Estimating Demographics of Custom Spatial Features

Accessing U.S. Census Bureau Data & Calculating Weighted Averages with Areal- and Population-Weighted Interpolation

1 Background

Note

For comments, suggestions, corrections, or questions on anything below, contact david.altare@waterboards.ca.gov, or open an issue on github.

Warning

This is a draft / work in progress – some parts are still under development, and existing parts may change.

This document provides an example of how to use tools available from the R programming language (R Core Team 2023) to estimate characteristics of any given target spatial area(s) (e.g., neighborhoods, project boundaries, water supplier service areas, etc.) based on data from a source dataset containing the characteristic data of interest (e.g., census data, CalEnvrioScreen scores, etc.), especially when the boundaries of the source and target areas overlap but don’t necessarily align with each other. It also provides some brief background on the various types of data available from the U.S Census Bureau, and links to a few places to find more in-depth information.

This particular example estimates demographic characteristics of community water systems in the Sacramento County area (the target dataset). It uses the tidycensus R package (Walker and Herman 2023) to access selected demographic data from the U.S. Census Bureau (the source dataset) for census units whose spatial extent covers those water systems’ service areas, then uses the sf package (Pebesma and Bivand 2023) package (for working with spatial data) and the tidyverse collection of packages (Wickham et al. 2019) (for general data cleaning and transformation) to estimate some demographic characteristics of each water system based on that census data. It also uses the areal R package (Prener et al. 2019) to check some of the results, and as general guidance on the principles and techniques for implementing areal interpolation.

This example is just intended to be a simplified demonstration of a possible workflow. For a real analysis, additional steps and considerations – that may not be covered here – may be needed to deal with data inconsistencies (e.g., missing or incomplete data), required level of precision and acceptable assumptions (e.g. more fine-grained datasets or more sophisticated techniques could be used to estimate/model population distributions), or other project-specific issues that might arise.

2 Setup

The code block below loads required packages for this analysis, and sets some user-defined options and defaults. If they aren’t already installed on your computer, you can install them with the R command install.packages('package-name') (and replace package-name with the name of the package you want to install).

# packages ----
library(tidycensus)
library(tigris)
library(tidyverse)
library(sf)
library(areal)
library(janitor)
library(here)
library(units)
library(knitr)
library(kableExtra)
library(tmap)
library(patchwork)
library(scales)
library(digest)
library(mapview)
library(biscale)
library(cowplot)
library(glue)
library(ggtext)

# conflicts ----
library(conflicted)
conflicts_prefer(dplyr::filter)

# options ----
options(scipen = 999) # turn off scientific notation
options(tigris_use_cache = TRUE) # use data caching for tigris

# reference system ----
## set common projected coordinate reference system used throughout this analysis
crs_projected <- 3310 # see: https://epsg.io/3310

3 Census Data Overview

This section provides some brief background on the various types of data available from the U.S. Census Bureau (a later section - Section 5 - demonstrates how to retrieve data from the U.S. Census Bureau using the tidycensus R package). Most of the information covered here comes from the book Analyzing US Census Data: Methods, Maps, and Models in R, which is a great source of information if you’d like more detail about any of the topics below (Walker 2023b).

Note

If you’re already familiar with Census data and want to skip this overview, go directly to the next section: Section 4

Different census products/surveys contain data on different variables, at different geographic scales, over varying periods of time, and with varying levels of certainty. Therefore, there are a number of judgement calls to make when determining which type of census data to use for an analysis – e.g., which data product to use (Decennial Census or American Community Survey), which geographic scale to use (e.g., Block, Block Group, Tract, etc.), what time frame to use, which variables to assess, etc.

More detailed information about U.S. Census Bureau’s data products and other topics mentioned below is available here.

3.1 Census Unit Geography / Hierarchy

Publicly available datasets from the U.S Census Bureau generally consist of individual survey responses aggregated to defined census units (e.g., census tracts) that cover varying geographic scales. Some of these units are nested and can be neatly aggregated (e.g., each census tract is composed of a collection of block groups, and each block group is composed of a collection of blocks), while other census units are outside this hierarchy (e.g., Zip Code Tabulation Areas don’t coincide with any other census unit). Figure 1 shows the relationship of all of the various census units.

Commonly used census statistical units like tracts and block groups have target population size ranges, and can be adjusted every 10 years (with the decennial census) based on population changes. For example, all ACS 5-year datasets prior to 2020 use the 2010 boundaries for tracts, block groups, and blocks, and all ACS 5-year datasets from 2020 onward (presumably through 2029) use the 2020 boundaries for those units. Census tracts are generally around 4,000 people, with a range from about 1,200 to 8,000, and block groups generally contain 600 to 3,000 people. Blocks are the smallest census units, and are “areas bounded by visible features, such as streets, roads, streams, and railroad tracks, and by nonvisible boundaries, such as selected property lines and city, township, school district, and county limits and short line-of-sight extensions of streets and roads”. For example, a census block may be “a city block bounded on all sides by streets”, while “blocks in suburban and rural areas may be larger, more irregular in shape, and bounded by a variety of features, such as roads, streams, and transmission lines”.

Caution

Census boundaries can change over time. Commonly used statistical units like tracts, block groups, and blocks tend to be revised every 10 years (with the decennial census), so it’s important to use a census boundary dataset that matches the version of the census demographic data you’re retrieving; otherwise, the demographic data may not match geographic areas in your boundary dataset. In some cases, a census unit that exists in a given year of the census data may not exist at all in a different year’s dataset, because census units can be split or merged when boundaries are revised.

For more information, see here or here or here or here.

For a list of the different geographic units available for each of the different census products/surveys (see Section 3.2) that can be accessed via the tidycensus package, go here.

Figure 1: Census Unit Hierarchies

3.2 Census Datasets / Surveys

The Decennial Census is conducted every 10 years, and is intended to provide a complete count of the US population and assist with political redistricting. As a result, it collects a relatively limited set of basic demographic data, but (should) provide a high degree of precision (i.e., in general it should provide exact counts). It is available for geographic units down to the census block (the smallest census unit available – see Section 3.1). For information about existing and planned future releases of 2020 census data products, go here.

The American Community Survey (ACS) provides a much larger array of demographic information than the Decennial Census, and is updated more frequently. The ACS is based on a sample of the population (rather than a count of the entire population, as in the Decennial Census), so it represents estimated values rather than precise counts; therefore, each data point is available as an estimate (typically labeled with an “E” in census variable codes, which are discussed in Section 3.3 ) along with an associated margin of error (typically labeled with “M” or “MOE” in census variable codes) around its estimated value.

The ACS is available in two formats. The 5-year ACS is a rolling average of 5 years of data (e.g., the 2021 5-year ACS dataset is an average of the ACS data from 2017 through 2021), and is generally available for geographic units down to the census block group (though some 5-year ACS data may only be available at less granular levels). The 1-year ACS provides data for a single year, and is only available for geographies with population greater than 65,000 (e.g., large cities and counties). Therefore, only the 5-year ACS will be useful for any analysis at a relatively fine scale (e.g., anything that requires data at or more detailed than the census tract level, or any analysis that considers smaller counties/cities – by definition, census tracts always contain significantly fewer than 65,000 people).

In addition to the Decennial Census and ACS data, a number of other census data products/surveys are also available. For example, see the censusapi R package (here or here) for access to over 300 census API endpoints. For historical census data, see the discussion here on using NHGIS, IPUMS, and the ipumsr package.

3.3 Census Variables / Codes

Each census product collects data for many different demographic variables, and each variable is generally associated with an identifier code. In order to access census data programmatically, you often need to know the code associated with each variable of interest. When determining which variables to use, you need to consider what census product contains those variables (see Section 3.2) and how they differ in terms of time frame, precision, spatial granularity (see Section 3.1), etc.

The tidycensus package offers a convenient generic way to search for variables across different census products using the load_variables() function, as described here.

The following websites may also be helpful for exploring the various census data products and finding the variable names and codes they contain:

4 Target Data Boundaries (Water Systems)

In this section, we’ll get the service area boundaries for Community Water Systems within the Sacramento County area. This will serve as the target dataset – i.e., the set of areas which we’ll be estimating the characteristics of – and will also be used to specify the geographic areas of the census data we want to retrieve. We’ll also get a dataset of county boundaries which overlap the water service areas in this study, which can also help with specifying what census data to access and/or be used to make maps and visualizations.

4.1 Read Water System Data

In this case, we’ll get the water system dataset from a shapefile that’s saved locally, then transform that dataset into a common coordinate reference system for mapping and analysis (which is defined above in the variable crs_projected).

This water system dataset comes from the California Drinking Water System Area Boundaries dataset. For this example, the dataset has been pre-filtered for systems within Sacramento County (by selecting records where the COUNTY field is “SACRAMENTO”) and for Community Water Systems (by selecting records where the STATE_CLAS field is “COMMUNITY”). Some un-needed fields have also been dropped, remaining fields have been re-orderd.

water_systems_sac <- st_read(here('02_data_input', 
                                  'water_supplier_boundaries_sac', 
                                  'System_Area_Boundary_Layer_Sac.shp')) %>% 
    st_transform(crs_projected) # transform to common coordinate system

We can use the glimpse function (below) to take get a sense of what type of information is available in the water system dataset and how it’s structured.

glimpse(water_systems_sac)
Rows: 62
Columns: 12
$ WATER_SY_1 <chr> "HOOD WATER MAINTENCE DIST [SWS]", "MC CLELLAN MHP", "MAGNO…
$ WATER_SYST <chr> "CA3400101", "CA3400179", "CA3400130", "CA3400135", "CA3400…
$ GLOBALID   <chr> "{36268DB3-9DB2-4305-A85A-2C3A85F20F34}", "{E3BF3C3E-D516-4…
$ BOUNDARY_T <chr> "Water Service Area", "Water Service Area", "Water Service …
$ OWNER_TYPE <chr> "L", "P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "P",…
$ COUNTY     <chr> "SACRAMENTO", "SACRAMENTO", "SACRAMENTO", "SACRAMENTO", "SA…
$ REGULATING <chr> "LPA64 - SACRAMENTO COUNTY", "LPA64 - SACRAMENTO COUNTY", "…
$ FEDERAL_CL <chr> "COMMUNITY", "COMMUNITY", "COMMUNITY", "COMMUNITY", "COMMUN…
$ STATE_CLAS <chr> "COMMUNITY", "COMMUNITY", "COMMUNITY", "COMMUNITY", "COMMUN…
$ SERVICE_CO <dbl> 82, 199, 34, 64, 128, 83, 28, 50, 164, 5684, 14798, 115, 33…
$ POPULATION <dbl> 100, 700, 40, 150, 256, 150, 32, 100, 350, 18005, 44928, 20…
$ geometry   <MULTIPOLYGON [m]> MULTIPOLYGON (((-132703 403..., MULTIPOLYGON (…

Note that this dataset already includes a POPULATION variable that indicates the population served by each water system, which we renamed to water_system_population_reported above (note: I’m not exactly how the data in this variable is derived). However, for this analysis we’ll be making our own estimate of the population within each system’s service area based on U.S. Census Bureau data and the spatial representation of the system boundaries. Given the uncertainty in how the reported population data was derived (including potential temporal differences), the population estimates produced here will likely will not exactly match the reported population data; but, the reported population data may serve as a useful check to make sure our estimates are reasonable.

To make the water system data easier to work with, we can make some more descriptive field names (note that while it’s redundant, we’re using the prefix water_system_ for all field names to distinguish data types when joining this data with other datasets later).

water_systems_sac <- water_systems_sac %>% 
    rename(water_system_name = WATER_SY_1, 
           water_system_number = WATER_SYST,
           water_system_id  = GLOBALID,
           water_system_boundary_type = BOUNDARY_T,
           water_system_owner_type  = OWNER_TYPE,
           water_system_county  = COUNTY,
           water_system_regulating_agency = REGULATING,
           water_system_federal_class = FEDERAL_CL,
           water_system_state_class = STATE_CLAS,
           water_system_service_connections = SERVICE_CO,
           water_system_population_reported = POPULATION)

Here’s a view of the structure of the revised dataset:

glimpse(water_systems_sac)
Rows: 62
Columns: 12
$ water_system_name                <chr> "HOOD WATER MAINTENCE DIST [SWS]", "M…
$ water_system_number              <chr> "CA3400101", "CA3400179", "CA3400130"…
$ water_system_id                  <chr> "{36268DB3-9DB2-4305-A85A-2C3A85F20F3…
$ water_system_boundary_type       <chr> "Water Service Area", "Water Service …
$ water_system_owner_type          <chr> "L", "P", "P", "P", "P", "P", "P", "P…
$ water_system_county              <chr> "SACRAMENTO", "SACRAMENTO", "SACRAMEN…
$ water_system_regulating_agency   <chr> "LPA64 - SACRAMENTO COUNTY", "LPA64 -…
$ water_system_federal_class       <chr> "COMMUNITY", "COMMUNITY", "COMMUNITY"…
$ water_system_state_class         <chr> "COMMUNITY", "COMMUNITY", "COMMUNITY"…
$ water_system_service_connections <dbl> 82, 199, 34, 64, 128, 83, 28, 50, 164…
$ water_system_population_reported <dbl> 100, 700, 40, 150, 256, 150, 32, 100,…
$ geometry                         <MULTIPOLYGON [m]> MULTIPOLYGON (((-132703 …

4.1.1 Alternative Data Retrieval Method

Reading in data from a shapefile is shown above because it’s likely one of the more common ways that users will access their target boundary data. However, depending on the dataset, there may be other ways to access the data. For example, the code chunk below demonstrates an alternative – using the arcgislayers package (Parry 2023) – that connects directly to the source dataset (to retrieve the most recent version) and applies the filters needed to reproduce the dataset in the System_Area_Boundary_Layer_Sac.shp file. Also, note that storing data in formats other than the common shapefile format – such as the geopackage format – can have some advantages (for example, see here).

# load arcgislayers package (see: https://r.esri.com/arcgislayers/index.html)
install.packages('pak') # only needed if the pak package is not already installed
pak::pkg_install("R-ArcGIS/arcgislayers", dependencies = TRUE)

library(arcgislayers)

# define link to data source
url_feature <- 'https://gispublic.waterboards.ca.gov/portalserver/rest/services/Drinking_Water/California_Drinking_water_system_numberem_Area_Boundaries/FeatureServer/0'

# connect to data source
water_systems_feature_layer <- arc_open(url_feature)

# download and filter data from source
water_systems_sac <- arc_select(
    water_systems_feature_layer,
    # apply filters
    where = "COUNTY = 'SACRAMENTO' AND STATE_CLASSIFICATION = 'COMMUNITY'",
    # select fields
    fields = c('WATER_SYSTEM_NAME', 'WATER_SYSTEM_NUMBER', 'GLOBALID',
               'BOUNDARY_TYPE', 'OWNER_TYPE_CODE', 'COUNTY',
               'REGULATING_AGENCY', 'FEDERAL_CLASSIFICATION', 
               'STATE_CLASSIFICATION', 'SERVICE_CONNECTIONS', 'POPULATION')) %>%
    # transform to common coordinate system
    st_transform(crs_projected) %>%
    # rename fields
    rename(water_system_name = WATER_SYSTEM_NAME,
           water_system_number = WATER_SYSTEM_NUMBER,
           water_system_id = GLOBALID,
           water_system_boundary_type = BOUNDARY_TYPE,
           water_system_owner_type = OWNER_TYPE_CODE,
           water_system_county = COUNTY,
           water_system_regulating_agency = REGULATING_AGENCY,
           water_system_federal_class = FEDERAL_CLASSIFICATION,
           water_system_state_class = STATE_CLASSIFICATION,
           water_system_service_connections = SERVICE_CONNECTIONS,
           water_system_population_reported = POPULATION)

4.2 Get County Boundaries

When accessing census data using the tidycensus R package as shown below (in Section 5), it’s often useful (though not strictly required) to know which counties overlap the target dataset (note that, even though the dataset is filtered for systems in Sacramento county, there are some systems whose boundaries extend into neighboring counties). County boundaries may also be useful for making maps in later stages of the analysis. We can get a dataset of county boundaries in California from the TIGER dataset, which can be accessed with R using the tigris R package (Walker 2023a).

counties_ca <- counties(state = 'CA', 
                        cb = TRUE) %>% # simplified
    st_transform(crs_projected) # transform to common coordinate system

Then, we can get a list of counties that overlap with the boundaries of the Sacramento area community water systems obtained above.

counties_overlap <- counties_ca %>% 
    st_filter(water_systems_sac, 
              .predicate = st_overlaps)

counties_list <- counties_overlap %>% pull(NAME)

The counties in the counties_list variable are: San Joaquin, Yolo, Placer, Sacramento.

4.3 Plot Target Data

Figure 2 shows the water systems and county boundaries in an interactive map.

mapview(counties_overlap, 
        alpha.regions = 0, 
        zcol = 'NAME', 
        layer.name = 'County', 
        legend = FALSE) + 
    mapview(water_systems_sac, 
            zcol = 'water_system_name', 
            layer.name = 'Water System', 
            legend = FALSE)
Figure 2: Selected water systems (with county boundaries for reference).

5 Accessing Census Data

The following sections demonstrate how to retrieve census data from the Decennial Census and the ACS using the tidycensus R package.

In order to use the tidycensus R package, you’ll need to obtain a personal API key from the US Census Bureau (which is free and available to anyone) by signing up here: http://api.census.gov/data/key_signup.html. Once you have your API key, you’ll need to register it in R by entering the command census_api_key(key = "YOUR API KEY", install = TRUE) in the console. Note that the install = TRUE argument means that the key is saved for all future R sessions, so you’ll only need to run that command once on your computer (rather than including it in your scripts). Alternatively, you could save your key to an environment variable and retrieve it using Sys.getenv(). Either way will help you avoid the possibility of entering your API key into any scripts that could be shared publicly.

Caution

Because the boundaries of census units (e.g., tracts, block groups, blocks, etc) can change over time, it’s important to make sure that the version (year) of the census data you’re retrieving matches the version of the census boundary dataset you’re using. The methods shown below retrieve the census boundary dataset together with the census demographic data, which ensures that this won’t be a potential problem. However, if you use a different workflow that retrieves the geographic boundaries and demographic data via separate processes, you should ensure that the versions are consistent.

Before downloading the census data, we can create an object that we can use to filter our requests to the census API so that they will only return census units that overlap with our target areas (the object will be passed to the filter_by argument of the get_decennial function below). Note that this isn’t strictly necessary (you could also apply the filter after making the API request), but may helpful to speed the query and reduce memory usage, especially in the case of large queries.

Note 1

At the time of this writing, the filter_by argument of the tidycensus get_decennial and get_acs functions is fairly new, and not yet included in the official documentation.

Also, the filter_by argument is optional, and only appears to accept a simple features (sf) object with a single row / feature (e.g., a single water system), and will not accept an sf object with multiple rows / features. The process below attempts to work around this constraint by joining all of the selected water systems into a single multi-part polygon (i.e., an sf object with a single row). However, if you only want to retrieve data for census units that overlap a single target area (e.g., a single water system), you can skip this step.

Listing 1: Create object for filtering the API query
water_systems_filter <- water_systems_sac %>% 
    st_union() %>% 
    st_as_sf()

5.1 Decennial Census

This section retrieves census data from the Decennial Census, using the get_decennial function from the tidycensus package. As of this writing, the most recent version of the decennial census data available is from 2020, and we can set that as a variable below.

# set year
decennial_year <- 2020

Next, we can define the list of demographic variables we’d like to retrieve tabular data for, by saving the census variables we want in the census_vars_decennial object (see Section 3.3 for more information about how to discover variables of interest and find their associated codes). Note that here we’re providing descriptive names associated with each variable code, which makes the data easier to work with later, but isn’t strictly necessary (i.e., you could just supply the variable codes alone).

# define variables to pull from the decennial census
census_vars_decennial <- c(
    'population_hispanic_or_latino_count' = 'P2_002N', # Total Hispanic or Latino
    'population_white_count' = 'P2_005N', # White (Not Hispanic or Latino)
    'population_black_or_african_american_count' = 'P2_006N', # Black or African American (Not Hispanic or Latino)
    'population_native_american_or_alaska_native_count' = 'P2_007N', # American Indian and Alaska Native (Not Hispanic or Latino)
    'population_asian_count' = 'P2_008N', # Asian (Not Hispanic or Latino)
    'population_pacific_islander_count' = 'P2_009N', # Native Hawaiian and Other Pacific Islander (Not Hispanic or Latino)
    'population_other_count' = 'P2_010N', # Some other race (Not Hispanic or Latino)
    'population_multiple_count' = 'P2_011N', # Two or more races (Not Hispanic or Latino)
    'population_total_count' = 'P2_001N'
)

Now, we can make the data request, using the get_decennial function, which accepts several arguments that specify exactly what data to return.

For this example we’re getting data at the ‘Block’ level (with the geography = 'block' argument) for the demographic variables defined above in the census_vars_decennial object (which is passed to the variables argument). As noted above, block-level data is the most granular level of spatial data available, and should provide the best results when estimating demographics for areas whose boundaries don’t align with census unit boundaries. However, depending on the use case, it may require too much time and computational resources to use the most granular spatial data, and may not be necessary to obtain a reasonable estimate. Also, keep in mind that block-level data may not be available for all variables, and some variables may only be available at less granular spatial scales (like block groups or tracts).

In addition to the tabular data associated with the demographic variables in our list, we’ll also get the spatial data – i.e., the boundaries of the census blocks – by setting the geometry = TRUE argument. When we do this, the tabular demographic data is pre-joined to the spatial data, so the API request returns a single dataset with both the spatial and attribute (demographic) data combined.

Note

The tidycensus package generally returns the Census Bureau’s cartographic boundary shapefiles by default (as opposed to the core TIGER/Line shapefiles, which is the default format returned by the tigris R package). The default cartographic boundary shapefiles are pre-clipped to the US coastline, and are smaller/faster to process (alternatively you can use cb = FALSE to get the core TIGER/Line data) (see here). So the default spatial data returned by tidycensus may be somewhat different than the default spatial data returned by the tigris package, but in general I find it’s best to use the default tidycensus spatial data.

Warning

At the block level, it appears that tidycensus only returns the more detailed core TIGER/Line shapefiles (i.e., they are identical to the default block-level geographic data returned by tigris). In some cases, that can create minor inconsistencies when working with both blocks and block groups and using the default geographies.

We also narrow down the search parameters geographically by specifying the state (with state = 'CA') and counties (county = counties_list) we’re seeking data for, and provide an object to the filter_by argument which filters the data returned so that it only includes census units that overlap with our target areas (see Note 1 above for more information).

Note

Supplying a list of counties may not be strictly necessary, especially in cases where you supply the optional filter_by argument. However, especially when working with granular data like blocks, supplying the county argument seems to greatly speed the API request.

Also, while by default the tidycensus package returns data in long/tidy format, we’re getting the data in wide format for this example (by specifying output = 'wide') because it’ll be easier to work with for the interpolation method described below to estimate demographics for non-census geographies.

Listing 2: Retrieve decennial census data
# get census data
census_data_decennial <- get_decennial(geography = 'block', # can be 'block', 'block group', 'tract', 'county', etc.
                                       state = 'CA', 
                                       county = counties_list,
                                       filter_by = water_systems_filter,
                                       year = decennial_year,
                                       variables = census_vars_decennial,
                                       output = 'wide', # can be 'wide' or 'tidy'
                                       geometry = TRUE,
                                       cache_table = TRUE) %>% 
    st_transform(crs_projected) # convert to common coordinate system

The output is an sf object (i.e., a dataframe-like object that also includes spatial data), in wide format, where each row represents a census unit, and the population of each racial/ethnic group is reported in a separate column. Here’s a view of the contents and structure of the Decennial Census data that’s returned:

glimpse(census_data_decennial)
Rows: 17,745
Columns: 12
$ GEOID                                             <chr> "060670019003011", "…
$ NAME                                              <chr> "Block 3011, Block G…
$ population_hispanic_or_latino_count               <dbl> 4, 6, 8, 11, 1, 14, …
$ population_white_count                            <dbl> 20, 4, 167, 70, 86, …
$ population_black_or_african_american_count        <dbl> 2, 2, 0, 8, 9, 18, 0…
$ population_native_american_or_alaska_native_count <dbl> 0, 0, 0, 0, 0, 0, 0,…
$ population_asian_count                            <dbl> 19, 5, 2, 1, 23, 8, …
$ population_pacific_islander_count                 <dbl> 0, 0, 0, 0, 0, 0, 0,…
$ population_other_count                            <dbl> 0, 0, 0, 0, 0, 0, 0,…
$ population_multiple_count                         <dbl> 8, 3, 4, 10, 5, 10, …
$ population_total_count                            <dbl> 53, 20, 181, 100, 12…
$ geometry                                          <POLYGON [m]> POLYGON ((-1…

5.2 American Community Survey (ACS)

To get data from the ACS, you can use the get_acs() function, which is very similar to the get_decennial() function used above. As of this writing, the most recent version of the 5-year ACS data available is the 2018-2022 ACS, and we can set that as a variable below (which makes it easier to update this document in future years).

# set year
acs_year <- 2022

However, since the ACS data contains data on a much broader set of socio-economic metrics, the requested data includes a greatly expanded list of variables, defined in the census_vars_acs object (see Section 3.3 for more information about how to discover variables of interest and find their associated codes). As above, we can provide descriptive names associated with each variable code, which makes the data easier to work with later, but isn’t strictly necessary (i.e., you could just supply the variable codes alone). Note that the use of prefixes (like population_ or households_) and suffixes (like _count) is intentional – those will be used later as part of the calculation process.

# define variables to pull from the ACS
census_vars_acs <- c(
    # --- population variables ---
    'population_total_count' = 'B01003_001',
    'population_hispanic_or_latino_count' = 'B03002_012', # Total Hispanic or Latino
    'population_white_count' = 'B03002_003', # White (Not Hispanic or Latino)
    'population_black_or_african_american_count' = 'B03002_004', # Black or African American (Not Hispanic or Latino)
    'population_native_american_or_alaska_native_count' = 'B03002_005', # American Indian and Alaska Native (Not Hispanic or Latino)
    'population_asian_count' = 'B03002_006', # Asian (Not Hispanic or Latino)
    'population_pacific_islander_count' = 'B03002_007', # Native Hawaiian and Other Pacific Islander (Not Hispanic or Latino)
    'population_other_count' = 'B03002_008', # Some other race (Not Hispanic or Latino)
    'population_multiple_count' = 'B03002_009', # Two or more races (Not Hispanic or Latino)
    
    # --- poverty variables ---
    'poverty_total_assessed_count' = 'B17021_001', # also available from 'B17020_001' (at the tract level only). Total population for whom poverty status is determined. Poverty status was determined for all people except institutionalized people, people in military group quarters, people in college dormitories, and unrelated individuals under 15 years old. These groups were excluded from the numerator and denominator when calculating poverty rates.
    'poverty_below_level_count' = 'B17021_002', # also available from 'B17020_002' (at the tract level only). Population whose income in the past 12 months is below federal poverty level. A family and every individual in it are considered to be in poverty if the family's total income is less than the dollar value of a threshold that varies depending upon size of family, number of children, & age of householder (for 1- & 2- person households). Income is the sum of wage/salary income; net self-employment income; interest/dividends/net rental/royalty income/income from estates & trusts; Social Security/Railroad Retirement income; Supplemental Security Income (SSI); public assistance/welfare payments; retirement/survivor/disability pensions; & all other income.
    'poverty_above_level_count' = 'B17021_019', # also available from 'B17020_010' (at the tract level only). Population whose income in the past 12 months is at or above federal poverty level. A family and every individual in it are considered to be in poverty if the family's total income is less than the dollar value of a threshold that varies depending upon size of family, number of children, & age of householder (for 1- & 2- person households). Income is the sum of wage/salary income; net self-employment income; interest/dividends/net rental/royalty income/income from estates & trusts; Social Security/Railroad Retirement income; Supplemental Security Income (SSI); public assistance/welfare payments; retirement/survivor/disability pensions; & all other income.
    
    # --- household variables ---
    'households_count' = 'B19001_001', # also available from variable 'B19053_001'. A household includes all the people who occupy a housing unit - a house, an apartment, a mobile home, a group of rooms, or a single room that is occupied. People not living in households are classified as living in group quarters.
    'average_household_size' = 'B25010_001', # A measure obtained by dividing the number of people living in occupied housing units by the total number of occupied housing units. This measure is rounded to the nearest hundredth.
    
    # --- household income variables ---
    'median_household_income' = 'B19013_001', # also available from 'B19019_001' (at the tract level only). Income in the past 12 months is the sum of wage or salary income; net self-employment income; interest, dividends, or net rental or royalty income or income from estates and trusts; Social Security or Railroad Retirement income; Supplemental Security Income (SSI); public assistance or welfare payments; retirement, survivor, or disability pensions; and all other income.
    'households_income_below_10k_count' = 'B19001_002', # count of households with income below $10,000 
    'households_income_10k_15k_count' = 'B19001_003', # count of households with income $10,000 to $15,000 
    'households_income_15k_20k_count' = 'B19001_004', 
    'households_income_20k_25k_count' = 'B19001_005', 
    'households_income_25k_30k_count' = 'B19001_006', 
    'households_income_30k_35k_count' = 'B19001_007', 
    'households_income_35k_40k_count' = 'B19001_008', 
    'households_income_40k_45k_count' = 'B19001_009', 
    'households_income_45k_50k_count' = 'B19001_010', 
    'households_income_50k_60k_count' = 'B19001_011', 
    'households_income_60k_75k_count' = 'B19001_012', 
    'households_income_75k_100k_count' = 'B19001_013', 
    'households_income_100k_125k_count' = 'B19001_014', 
    'households_income_125k_150k_count' = 'B19001_015', 
    'households_income_150k_200k_count' = 'B19001_016',
    'households_income_above_200k_count' = 'B19001_017', # count of households with income above $200,000
    
    # --- housing costs variables (% of household income) ---
    # Housing Costs as a Percentage of Household Income in the past 12 months - NOTE: THIS TABLE IS NEW FOR THE 2022 ACS, AND WON'T BE AVAILABLE FOR PREVIOUS YEARS - Table B25140 shows the count of households paying more than 30% of their income towards housing costs broken out by three tenure categories (owned with a mortgage, owned without a mortgage, and rented). The table also shows the number of households paying more than 50% of their income toward housing costs.
    # 'households_count' = 'B25140_001', 
    'households_mortgage_total_count' = 'B25140_002',
    'households_mortgage_housing_costs_over30pct_count' = 'B25140_003',
    'households_mortgage_housing_costs_over50pct_count' = 'B25140_004',
    'households_no_mortgage_total_count' = 'B25140_006',
    'households_no_mortgage_housing_costs_over30pct_count' = 'B25140_007',
    'households_no_mortgage_housing_costs_over50pct_count' = 'B25140_008',
    'households_rent_total_count' = 'B25140_010',
    'households_rent_housing_costs_over30pct_count' = 'B25140_011',
    'households_rent_housing_costs_over50pct_count' = 'B25140_012',
    
    # --- other income / economic variables ---
    'per_capita_income' = 'B19301_001' # note: per capita income by race (at block group level) available in table B19301I
)

Finally, we can make the data request, using the get_acs function, which is very similar to the get_decennial function described above ( Section 5.1). However, for this example we’re getting data at the ‘Block Group’ level (with the geography = 'block group' argument), which is the most granular level of spatial data available for ACS data. But, keep in mind that block group-level data may not be available for all variables, and some variables may only be available at less granular spatial scales (like tracts). Note that the water_systems_filter object supplied to the filter_by argument was created above in Listing 1 (and see Note 1 above for more information about this argument).

Listing 3: Retrieve ACS data
# get census data
census_data_acs <- get_acs(geography = 'block group',
                           state = 'CA', 
                           county = counties_list,
                           filter_by = water_systems_filter,
                           year = acs_year,
                           survey = 'acs5',
                           variables = census_vars_acs, 
                           output = 'wide', # can be 'wide' or 'tidy'
                           geometry = TRUE,
                           cache_table = TRUE) %>% 
    st_transform(crs_projected) # convert to common coordinate system

As above, the output is an sf object (i.e., a dataframe-like object that also includes spatial data), in wide format, where each row represents a census unit, and the each demographic variable is reported in a separate column. Here’s a view of the contents and structure of the 2022 5-year ACS data that’s returned (only the first few fields are shown):

glimpse(census_data_acs[,1:20])
Rows: 1,054
Columns: 21
$ GEOID                                              <chr> "060670081451", "06…
$ NAME                                               <chr> "Block Group 1; Cen…
$ population_total_countE                            <dbl> 1768, 1881, 1098, 2…
$ population_total_countM                            <dbl> 520, 585, 395, 583,…
$ population_hispanic_or_latino_countE               <dbl> 38, 327, 376, 782, …
$ population_hispanic_or_latino_countM               <dbl> 59, 298, 280, 315, …
$ population_white_countE                            <dbl> 1627, 1337, 293, 18…
$ population_white_countM                            <dbl> 521, 475, 191, 460,…
$ population_black_or_african_american_countE        <dbl> 0, 1, 272, 26, 351,…
$ population_black_or_african_american_countM        <dbl> 13, 3, 251, 38, 334…
$ population_native_american_or_alaska_native_countE <dbl> 41, 0, 0, 26, 0, 0,…
$ population_native_american_or_alaska_native_countM <dbl> 58, 13, 13, 42, 13,…
$ population_asian_countE                            <dbl> 45, 0, 105, 58, 144…
$ population_asian_countM                            <dbl> 71, 13, 116, 66, 18…
$ population_pacific_islander_countE                 <dbl> 0, 98, 0, 0, 27, 13…
$ population_pacific_islander_countM                 <dbl> 13, 98, 13, 13, 50,…
$ population_other_countE                            <dbl> 0, 0, 39, 0, 0, 0, …
$ population_other_countM                            <dbl> 13, 13, 63, 13, 13,…
$ population_multiple_countE                         <dbl> 17, 118, 13, 39, 15…
$ population_multiple_countM                         <dbl> 27, 125, 20, 57, 25…
$ geometry                                           <POLYGON [m]> POLYGON ((-…

Note that the dataset that’s returned includes fields corresponding to Margin of Error (MOE) for each variable we’ve requested (these are the fields that end an M – e.g., “population_total_countM”), since, as noted above in Section 3.2 , the ACS is based on a sample of the population and reports estimated values.

For further analysis, we may want to get the statewide data as a baseline for comparison (this could also be done for other scales, like the county level). We can use a similar process to get that data and clean/format it to match the more detailed data obtained above. Note that in this case we’re also using the 5-year ACS (even though the 1-year ACS is also available at the statewide level, and would provide more up-to-date data) so that the statewide data will be directly comparable to the block group level data obtained above.

census_data_acs_state <- get_acs(geography = 'state',
                                 state = 'CA', 
                                 year = acs_year,
                                 survey = 'acs5',
                                 variables = census_vars_acs, 
                                 output = 'wide', # can be 'wide' or 'tidy'
                                 geometry = TRUE,
                                 cache_table = TRUE) %>% 
    st_transform(crs_projected) %>%  # convert to common coordinate system
    select(-matches('M$')) %>%  # the $ specifies "ends with"
    # clean names (note this is a little different than the way we renamed fields above, either works)
    rename_with(.fn = ~ str_remove(., # remove 'E' (estimate) from field names
                                   pattern = 'E$')) %>% 
    rename_with(.fn = ~ str_replace(., # add 'E' back to NAME field
                                    pattern = 'NAM', 
                                    replacement = 'NAME'))

5.3 Plot Census & Supplier Data

system_plot <- 'SACRAMENTO SUBURBAN WATER DISTRICT'

Figure 3 shows the 2022 5-year ACS census units that overlap with one of the water systems (Sacramento Suburban Water District) that we’ll compute demographics for below (plotting the census units that overlap all systems tends to be slow in this format).

mapview(water_systems_sac %>% 
            filter(water_system_name == system_plot), 
        zcol = 'water_system_name', 
        layer.name = 'Water System', 
        legend = FALSE) +
    mapview(census_data_acs %>% 
                st_filter(water_systems_sac %>% 
                              filter(water_system_name == system_plot)), 
            alpha.regions = 0, 
            color = 'cyan', 
            lwd = 1.3, label = 'NAME',  
            layer.name = 'ACS Data', 
            legend = FALSE) #  zcol = 'NAME'
Figure 3: Water system Sacramento Suburban Water District (filled polygon) and boundaries of census units (light blue) that will be used to estimate water system demographics.

6 Compute Water System Demographics

Now we can perform the calculations to estimate demographic characteristics for our target areas (water system service boundaries in the Sacramento County area) from our source demographic dataset (the census data we obtained above). For this example, we’ll use the 2022 5-year ACS data that we retrieved above (which is saved in the census_data_acs variable) as our source of demographic data, and we’ll estimate the following for each water system’s service area:

  • Population of each racial/ethnic group (using the racial/ethnic categories defined in the census dataset), and each racial/ethnic group’s portion of the total service area population
  • Socio-economic variables like poverty rate, median household income, income distributions, and per capita income

There are multiple ways this estimation can be done. For this example, we’ll employ a three step strategy:

  1. Estimate values for count-based variables (typically referred to as ‘extensive’ data types) – e.g., total population, popultion by race/ethnicity, population above / below poverty rate, households by income bracket – for overlapping census unit, using areal interpolation. This is essentially an area weighted average, which estimates how much of each source unit’s (census unit) count applies to the target area (a given water service area), based on the portion of its area that overlaps that target area – for more information about the process, see this documentation from the areal R package. For example, for a census unit that partially overlaps a service area, only a fraction of its count for a given variable will be applied to that service area; for a census unit that completely overlaps a service area, the full count for that variable will be applied to the service area.

    The major simplifying assumption of this approach is that the population or count-based variable of interest are evenly distributed within each unit in the source data. For example, in this case we’re assuming that population (including the total population and the population of each racial/ethic group), households of each income bracket, populations above / below the poverty rate, etc. are evenly distributed within each census block group.

Tip

While this section uses the block group-level count data from the 5-year ACS, there may be cases where it could be useful or necessary to use more granular block-level population data from the decennial census to estimate population densities and distributions within block groups. This could especially be the case when estimating characteristics for small and/or rural areas. See Section 11.2 for an approach which implements a method that does that, and Section 9 for detailed estimates of population alone.

See Section 10 for more information about challenges estimating values for small / rural areas.

  1. Using the estimated count data (populations, households, etc), compute weighted values for variables that describe those populations, using the associated count data as a weighting factor (e.g., population-weighted values for population based data, or household-weighted values for household-based data) – these variables are typically referred to as ‘intensive’ data types.
Tip

Although it’s possible to use areal interpolation to aggregate these variables as well, the multi-step approach described here can be useful because we know (from the population / household count data) that population densities differ between census units. Since we have a reasonable estimate of the count data (population, households, etc) within each census unit, using a population or household weighted average likely will yield more accurate results than a simple area-weighted average for these variables. For example, for per capita income, we can use the estimated population counts to produce a population weighted average per capita income (rather than an area weighted average per capita income, which is likely less meaningful as it over-weights large census areas with lower population densities). Areal interpolation may be more useful for cases where we generally have no other information about how density varies between the source polygons (unless significantly more effort is invested, such as looking at aerial imagery data)

  1. Aggregate interpolated values at the water system level.

6.1 Prepare Census Data

Note that we already transformed the 2022 5-year ACS dataset into the common projected coordinate reference system used for this example immediately after we downloaded the data using the get_acs() function (see Listing 3). This allows us to work with the water system data and the census data together in a common coordinate system.

Before calculating demographics for the target areas, we can do a bit of additional transformation to prepare the census data. First, because we won’t be incorporating the margin of error (MOE) into the analysis below, we can drop them for this example, then clean up the field names.

Tip

It is possible to calculate MOEs for derived estimates – e.g., when aggregating groups of census units – and in many cases it may be worthwhile to do that to provide extra context to the data. However, it may not be possible (or may be very difficult) to calculate MOEs for data estimated using more complex aggregations, such as the areal interpolation shown below – more research on that may be needed.

For guidance on how calculate MOEs for some types of derived estimates, see this document.

For an alternative, simplified approach to estimating census demographics for target areas which includes MOEs for the derived estimates, see Section 11.1.

# drop MOE fields
census_data_acs <- census_data_acs %>% 
    select(-matches('M$')) # the $ specifies "ends with"

# clean names
names(census_data_acs) <- names(census_data_acs) %>% 
    str_remove('E$') %>% # remove 'E' (estimate) from field names
    str_replace('NAM', 'NAME') # add 'E' back to NAME field

Here’s a view of the contents and structure of the revised 2022 5-year ACS dataset (only the first few fields are shown):

glimpse(census_data_acs[,1:20])
Rows: 1,054
Columns: 21
$ GEOID                                             <chr> "060670081451", "060…
$ NAME                                              <chr> "Block Group 1; Cens…
$ population_total_count                            <dbl> 1768, 1881, 1098, 27…
$ population_hispanic_or_latino_count               <dbl> 38, 327, 376, 782, 3…
$ population_white_count                            <dbl> 1627, 1337, 293, 181…
$ population_black_or_african_american_count        <dbl> 0, 1, 272, 26, 351, …
$ population_native_american_or_alaska_native_count <dbl> 41, 0, 0, 26, 0, 0, …
$ population_asian_count                            <dbl> 45, 0, 105, 58, 144,…
$ population_pacific_islander_count                 <dbl> 0, 98, 0, 0, 27, 13,…
$ population_other_count                            <dbl> 0, 0, 39, 0, 0, 0, 0…
$ population_multiple_count                         <dbl> 17, 118, 13, 39, 15,…
$ poverty_total_assessed_count                      <dbl> 1768, 1847, 1098, 27…
$ poverty_below_level_count                         <dbl> 101, 328, 272, 116, …
$ poverty_above_level_count                         <dbl> 1667, 1519, 826, 263…
$ households_count                                  <dbl> 680, 718, 405, 905, …
$ average_household_size                            <dbl> 2.59, 2.62, 2.71, 2.…
$ median_household_income                           <dbl> 123500, 66768, 56216…
$ households_income_below_10k_count                 <dbl> 18, 47, 10, 22, 6, 1…
$ households_income_10k_15k_count                   <dbl> 0, 0, 24, 0, 15, 231…
$ households_income_15k_20k_count                   <dbl> 0, 13, 18, 0, 51, 12…
$ geometry                                          <POLYGON [m]> POLYGON ((-1…

We can also do some other transformations to simplify the data. For example, we can combine the ‘other’ and ‘multiple’ racial/ethnic groupings into one ‘other or multiple’ racial/ethnic group.

## combine other and multiple
census_data_acs <- census_data_acs %>% 
    mutate('population_other_or_multiple_count' = 
               population_other_count + population_multiple_count, 
           .after = population_pacific_islander_count) %>% 
    select(-c(population_other_count, population_multiple_count))

And we can calculate the poverty rate for each census unit (which may be useful for presenting results later).

census_data_acs <- census_data_acs %>% 
    mutate(poverty_rate_pct_calc_census_unit = case_when(
        poverty_total_assessed_count == 0 ~ 0,
        .default = 100 * poverty_below_level_count / poverty_total_assessed_count
    ), 
    .after = poverty_above_level_count)

6.2 Interpolation Step 1: Areal Interpolation (for Count Variables)

There are a couple of ways to implement the areal interpolation method. The example below ‘manually’ implements the process using functions from the sf package, for reasons described below. However, note that there are R packages which make it possible to perform areal interpolation with a single function - for example, the sf package’s st_interpolate_aw function and the areal package’s aw_interpolate function. This example uses a more ‘manual’ approach because this makes it possible to use the multi-step process described above, and also produces useful intermediate calculated data for mapping and visualization. However, we can use the single-function approach to double check our implementation of the areal interpolation approach for the count data (see Section 6.4.2).

Warning

Areal interpolation may not work well in some cases (for example, in areas that are largely rural or near uninhabiated areas.) In these cases, it’s possible to use more granular block-level population data from the decennial census to estimate population densities and distributions within block groups. See Section 11.2 for an approach which implements a method for doing that.

First, we clip the census data to the water system boundaries:

census_data_clip <- census_data_acs %>% 
    mutate(census_unit_area = st_area(.)) %>% 
    st_intersection(water_systems_sac) %>% 
    mutate(clipped_area = st_area(.)) %>% 
    mutate(areal_weight_factor = drop_units(clipped_area / census_unit_area))

Figure 4 shows a plot of the census units clipped to the Sacramento Suburban Water District water system, along with the original/complete census units. Note that you can toggle layers on and off (and change their order of appearance) using the layers button in the upper left part of the map (below the zoom buttons).

mapview(water_systems_sac %>% 
            filter(water_system_name == system_plot), 
        zcol = 'water_system_name', 
        layer.name = 'Water System', 
        legend = FALSE) + 
    mapview(census_data_acs %>% 
                st_filter(water_systems_sac %>% 
                              filter(water_system_name == system_plot)), 
            alpha.regions = 0.15, 
            col.regions = 'grey', 
            color = 'black', 
            lwd = 1, 
            label = 'NAME',  
            layer.name = 'ACS Data Full', 
            legend = FALSE) +
    mapview(census_data_clip %>% 
                filter(water_system_name == system_plot),
            alpha.regions = 0, 
            color = 'cyan', 
            lwd = 1.3, 
            label = 'NAME',  
            layer.name = 'ACS Data Clipped', 
            legend = FALSE)
Figure 4: Water system Sacramento Suburban Water District (filled polygon), boundaries of overalpping census units (grey), and clipped portions of census units (light blue) that will be used to estimate water system demographics.

Next, we can compute the area-weighted counts for the portions of census units that overlap each water system boundary:

census_data_interpolate <- census_data_clip %>% 
    mutate(
        across(
            .cols = ends_with('_count'),
            .fns = ~ .x * areal_weight_factor
        )) 

As noted above, it’s also possible to use pre-built functions from several R packages to perform areal interpolation in a single step. Since we’re using a three-step process, which also implements population weighted averaging for some variables, we’re not using those functions directly in this example. However, they can be a useful check to validate our computed count data, but only after we aggregate our data at the system level – see Section 6.4.2 for more details.

6.3 Interpolation Step 2: Compute Population Weighted Values (Intensive Variables)

Compute population weighted values

census_data_interpolate <- census_data_interpolate %>% 
    mutate(average_household_size_weighted = average_household_size * households_count,
           median_household_income_weighted = median_household_income * households_count,
           per_capita_income_weighted = per_capita_income * population_total_count)
Caution

To calculate an aggregated value for a variable like median household income, which depends on the distribution of the underling data, it may be worth considering whether a weighed average value is an appropriate measure. In some cases, it may be more appropriate to use the counts in each income bracket to estimate a median income, and/or present the income distribution rather than a single value.

For a discussion of the problem and a proposed solution, see this document.

6.4 Interpolation Step 3: Aggregate by Water System

Next, we need to combine the weighted values calculated above to produce the estimates for each water system, and can also use those combined values to compute some additional metrics for each system (like rates, income distributions, etc.).

6.4.1 Combine Results by Water System

First, combine the results by summing all of the count-based variables (derived from areal interpolation), and calculating weighted averages for all variables computed in step 2 above. Note that we have to first calculate the denominator for each variable calculated with population weighted interpolation, because some of those variables contain missing values for records where the denominator is present (and if we don’t remove the missing values, we get an NA for any water system that contains a block group with a missing value for that variable). For example, there are block groups where the median household income is missing, but the total household count is available for that block group – in that case, the weighted average should not include the households in that block group in the denominator; otherwise, the true value will be underestimated.

water_system_demographics <- census_data_interpolate %>% 
    group_by(water_system_name) %>% 
    mutate(
        average_household_size_denominator = if_else(
            is.na(average_household_size), 
            0, 
            households_count),
        median_household_income_denominator = if_else(
            is.na(median_household_income), 
            0, 
            households_count),
        per_capita_income_denominator = if_else(
            is.na(per_capita_income), 
            0, 
            population_total_count)
    ) %>% 
    summarize(
        across(
            .cols = ends_with('_count'),
            .fns = ~ sum(.x)
        ),
        average_household_size_hh_weighted = 
            sum(average_household_size_weighted, na.rm = TRUE) / 
            sum(average_household_size_denominator),
        median_household_income_hh_weighted = 
            sum(median_household_income_weighted, na.rm = TRUE) /
            sum(median_household_income_denominator),
        per_capita_income_pop_weighted = 
            sum(per_capita_income_weighted, na.rm = TRUE) / 
            sum(per_capita_income_denominator)
    ) %>% 
    ungroup()

6.4.2 Check - Count Variables Estimated with Areal Interpolation

As noted above, it’s also possible to use pre-built functions for areal interpolation. This section demonstrates those functions and uses them as a check of our computed count data.

From the sf package, we can use the st_interpolate_aw function:

# NOTE: it's only necessary to check the estimated values for one variable - 
# this just checks the total estimated population

# sf package
check_sf <- st_interpolate_aw(x = census_data_acs %>% 
                                  select(population_total_count),
                              to = water_systems_sac,
                              extensive = TRUE) %>% 
    bind_cols(water_systems_sac %>% st_drop_geometry)

# check - should be TRUE if results are equivalent
all(check_sf %>% 
        arrange(water_system_name) %>% 
        pull(population_total_count) %>% 
        round(5) ==
        water_system_demographics %>% 
        arrange(water_system_name) %>% 
        pull(population_total_count) %>% 
        round(5)
)
[1] TRUE

From the areal package, we can use the aw_interpolate function. Note that there are some settings that you may need to modify in the aw_interpolate function depending on the type of analysis you’re doing. In particular, for more information about the weight argument – which can be either sum or total – see this section of the documentation. For more information about extensive versus intensive interpolations, see this section of the documenation (as noted above, the method applied here avoids using areal interpolation to calculate intensive variables, because area may not be a good metric for determining how to weight those variables, considering that we can estimate associated counts for populations/households/etc.).

# NOTE: it's only necessary to check the estimated values for one variable - 
# this just checks the total estimated population

# areal package
check_areal <- aw_interpolate(water_systems_sac,
                              tid = water_system_name,
                              source = census_data_acs,
                              sid = GEOID,
                              weight = 'total',
                              extensive = c('population_total_count'))

# check - should be TRUE if results are equivalent
all(check_areal %>% 
        arrange(water_system_name) %>% 
        pull(population_total_count) %>% 
        round(5) ==
        water_system_demographics %>% 
        arrange(water_system_name) %>% 
        pull(population_total_count) %>% 
        round(5)
)
[1] TRUE

6.4.3 Clean & Format Summarized Water System Demographic Data

We could stop here, and save the dataset containing the results to an output file (done below - see Section 6.5). But, it may be useful to do some additional computations and re-formatting before saving the dataset. For example, in this case it may be useful to calculate the racial/ethnic breakdown of each system’s population as percentages of the total population (in addition to the total counts computed above), and calculate other rates / distributions.

First we can add fields with each racial/ethnic group’s estimated percent of the total population within each water system’s service area:

water_system_demographics <- water_system_demographics %>%
    mutate(
        across(
            .cols = starts_with('population_'),
            .fns = ~ round(.x / population_total_count * 100, 2),
            .names = "{str_replace(.col, '_count', '_percent')}"
        ),
        .after = population_other_or_multiple_count) %>% 
    select(-population_total_percent) # this always equals 1, not needed

We can also calculate the estimated poverty rate for each water system’s service area.

water_system_demographics <- water_system_demographics %>% 
    mutate(poverty_rate_percent = case_when(
        poverty_total_assessed_count == 0 ~ 0,
        .default = 100 * poverty_below_level_count / poverty_total_assessed_count
    ), 
    .after = poverty_above_level_count)

And compute income brackets in 25k increments:

water_system_demographics <- water_system_demographics %>% 
    mutate(households_income_0_25k_count = 
               households_income_below_10k_count + 
               households_income_10k_15k_count + 
               households_income_15k_20k_count +
               households_income_20k_25k_count,
           households_income_25k_50k_count =
               households_income_25k_30k_count + 
               households_income_30k_35k_count +
               households_income_35k_40k_count +
               households_income_40k_45k_count +
               households_income_45k_50k_count,
           households_income_50k_75k_count =
               households_income_50k_60k_count +
               households_income_60k_75k_count,
           .after = households_income_above_200k_count
    ) # note - above 75k is already in 25k increments

And compute income brackets in 50k increments:

water_system_demographics <- water_system_demographics %>% 
    mutate(households_income_0_50k_count = 
               households_income_0_25k_count + 
               households_income_25k_50k_count,
           households_income_50k_100k_count =
               households_income_50k_75k_count +
               households_income_75k_100k_count,
           households_income_100k_150k_count =
               households_income_100k_125k_count +
               households_income_125k_150k_count,
           .after = households_income_50k_75k_count
    ) # above 150k is already in 50k increments

And compute grouped median household income:

And compute # and % of households below income thresholds:

And, compute the portion of households paying more than 30% / 50% of their income toward housing costs:

# portion of households paying more than 30% / 50% of income on housing
water_system_demographics <- water_system_demographics %>%
    mutate(households_all_housing_costs_over30pct_percent = 
               100 * (households_mortgage_housing_costs_over30pct_count + 
                          households_no_mortgage_housing_costs_over30pct_count +
                          households_rent_housing_costs_over30pct_count) / 
               households_count, 
           .after = households_rent_housing_costs_over50pct_count) %>% 
    mutate(households_all_housing_costs_over50pct_percent = 
               100 * (households_mortgage_housing_costs_over50pct_count + 
                          households_no_mortgage_housing_costs_over50pct_count +
                          households_rent_housing_costs_over50pct_count) / 
               households_count,
           .after = households_all_housing_costs_over30pct_percent)

Finally, we can round the estimated values to appropriate levels of precision:

water_system_demographics <- water_system_demographics %>%
    mutate(
        across(
            .cols = ends_with('_count'),
            .fns = ~ round(.x, 0)
        ))  %>%
    mutate(
        across(
            .cols = ends_with('_percent'),
            .fns = ~ round(.x, 2)
        ))

We now have a dataset with the selected metrics from the census data (source data) estimated for each of the water system service areas (target geographic features). Here’s a view of the contents and structure of the re-formatted dataset (only the first few fields are shown):

glimpse(water_system_demographics[,1:20])
Rows: 62
Columns: 21
$ water_system_name                                   <chr> "B & W RESORT MARI…
$ population_total_count                              <dbl> 0, 22603, 33120, 1…
$ population_hispanic_or_latino_count                 <dbl> 0, 10939, 5245, 34…
$ population_white_count                              <dbl> 0, 3504, 19456, 23…
$ population_black_or_african_american_count          <dbl> 0, 2663, 3199, 197…
$ population_native_american_or_alaska_native_count   <dbl> 0, 121, 113, 70, 0…
$ population_asian_count                              <dbl> 0, 4075, 2947, 108…
$ population_pacific_islander_count                   <dbl> 0, 240, 77, 59, 0,…
$ population_other_or_multiple_count                  <dbl> 0, 1060, 2082, 110…
$ population_hispanic_or_latino_percent               <dbl> 41.43, 48.40, 15.8…
$ population_white_percent                            <dbl> 52.47, 15.50, 58.7…
$ population_black_or_african_american_percent        <dbl> 0.00, 11.78, 9.66,…
$ population_native_american_or_alaska_native_percent <dbl> 0.00, 0.54, 0.34, …
$ population_asian_percent                            <dbl> 4.55, 18.03, 8.90,…
$ population_pacific_islander_percent                 <dbl> 0.00, 1.06, 0.23, …
$ population_other_or_multiple_percent                <dbl> 1.56, 4.69, 6.29, …
$ poverty_total_assessed_count                        <dbl> 0, 22556, 33034, 1…
$ poverty_below_level_count                           <dbl> 0, 6010, 3389, 313…
$ poverty_above_level_count                           <dbl> 0, 16546, 29645, 6…
$ poverty_rate_percent                                <dbl> 22.60, 26.64, 10.2…
$ geometry                                            <POLYGON [m]> POLYGON ((…

Table 1 provides a complete view of the cleaned and re-formatted dataset. These results are saved locally in tabular and spatial format in Section 6.5.

pct_format <- label_percent(accuracy = 0.01)

water_system_demographics %>%
    st_drop_geometry() %>% 
    mutate(across(
        .cols = ends_with('_percent'),
        .fns = ~ pct_format(. / 100))
    ) %>%
    rename_with(.cols = everything(), 
                .fn = ~ str_replace_all(., pattern = '_', replacement = ' ') %>% 
                    str_to_title(.)) %>% 
    kable(align = 'c', 
          format.args = list(big.mark = ',')
    ) %>%
    scroll_box(height = "400px")
Table 1: Estimated Water System Demographics
Water System Name Population Total Count Population Hispanic Or Latino Count Population White Count Population Black Or African American Count Population Native American Or Alaska Native Count Population Asian Count Population Pacific Islander Count Population Other Or Multiple Count Population Hispanic Or Latino Percent Population White Percent Population Black Or African American Percent Population Native American Or Alaska Native Percent Population Asian Percent Population Pacific Islander Percent Population Other Or Multiple Percent Poverty Total Assessed Count Poverty Below Level Count Poverty Above Level Count Poverty Rate Percent Households Count Households Income Below 10k Count Households Income 10k 15k Count Households Income 15k 20k Count Households Income 20k 25k Count Households Income 25k 30k Count Households Income 30k 35k Count Households Income 35k 40k Count Households Income 40k 45k Count Households Income 45k 50k Count Households Income 50k 60k Count Households Income 60k 75k Count Households Income 75k 100k Count Households Income 100k 125k Count Households Income 125k 150k Count Households Income 150k 200k Count Households Income Above 200k Count Households Income 0 25k Count Households Income 25k 50k Count Households Income 50k 75k Count Households Income 0 50k Count Households Income 50k 100k Count Households Income 100k 150k Count Households Mortgage Total Count Households Mortgage Housing Costs Over30pct Count Households Mortgage Housing Costs Over50pct Count Households No Mortgage Total Count Households No Mortgage Housing Costs Over30pct Count Households No Mortgage Housing Costs Over50pct Count Households Rent Total Count Households Rent Housing Costs Over30pct Count Households Rent Housing Costs Over50pct Count Households All Housing Costs Over30pct Percent Households All Housing Costs Over50pct Percent Average Household Size Hh Weighted Median Household Income Hh Weighted Per Capita Income Pop Weighted
B & W RESORT MARINA 0 0 0 0 0 0 0 0 41.43% 52.47% 0.00% 0.00% 4.55% 0.00% 1.56% 0 0 0 22.60% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40.53% 21.84% 2.030000 51,977.00 40,522.00
CAL AM FRUITRIDGE VISTA 22,603 10,939 3,504 2,663 121 4,075 240 1,060 48.40% 15.50% 11.78% 0.54% 18.03% 1.06% 4.69% 22,556 6,010 16,546 26.64% 6,900 354 339 521 263 367 302 359 355 565 692 876 784 459 235 287 141 1,477 1,948 1,569 3,425 2,352 694 1,620 745 345 1,236 95 58 4,044 2,131 1,059 43.06% 21.18% 3.257806 53,040.44 20,519.57
CALAM - ANTELOPE 33,120 5,245 19,456 3,199 113 2,947 77 2,082 15.84% 58.74% 9.66% 0.34% 8.90% 0.23% 6.29% 33,034 3,389 29,645 10.26% 10,529 315 184 101 122 116 469 248 368 449 737 1,077 1,669 1,501 1,077 1,158 937 723 1,650 1,814 2,373 3,483 2,578 5,544 1,861 621 1,747 184 106 3,238 1,678 649 35.36% 13.07% 3.134530 93,741.55 34,660.44
CALAM - ARDEN 10,112 3,433 2,392 1,977 70 1,082 59 1,100 33.95% 23.65% 19.55% 0.69% 10.70% 0.58% 10.87% 10,034 3,130 6,904 31.19% 3,823 201 259 239 167 319 190 142 236 207 440 394 535 228 148 62 58 866 1,093 834 1,959 1,368 376 265 84 46 133 8 3 3,426 2,124 1,170 57.97% 31.87% 2.623643 49,624.62 22,770.82
CALAM - ISLETON 34 14 17 0 0 2 0 1 42.06% 51.14% 0.00% 0.00% 4.55% 0.00% 2.25% 34 7 27 20.89% 16 1 1 0 1 1 0 1 1 0 2 1 1 3 1 0 1 4 3 3 6 4 4 6 4 1 7 2 2 4 1 1 39.45% 20.68% 2.078994 57,361.76 40,672.21
CALAM - LINCOLN OAKS 42,916 9,056 26,529 1,486 143 2,706 288 2,708 21.10% 61.82% 3.46% 0.33% 6.31% 0.67% 6.31% 42,823 4,074 38,749 9.51% 15,621 740 375 308 622 488 616 585 629 645 1,035 1,641 2,442 1,889 1,272 1,555 778 2,046 2,964 2,675 5,010 5,118 3,161 7,390 2,671 919 3,332 503 298 4,900 2,523 1,302 36.46% 16.13% 2.730281 82,035.52 33,728.94
CALAM - PARKWAY 58,635 18,665 8,921 6,965 21 19,228 1,386 3,449 31.83% 15.21% 11.88% 0.04% 32.79% 2.36% 5.88% 58,434 9,804 48,630 16.78% 17,667 1,081 753 514 713 694 640 713 700 727 1,145 1,918 2,490 1,634 1,532 1,546 865 3,061 3,475 3,064 6,536 5,554 3,166 7,163 2,719 1,049 3,418 647 383 7,086 3,517 1,917 38.96% 18.96% 3.284608 72,938.51 26,938.14
CALAM - SUBURBAN ROSEMONT 57,897 13,791 25,062 7,725 91 6,905 380 3,942 23.82% 43.29% 13.34% 0.16% 11.93% 0.66% 6.81% 57,661 8,374 49,287 14.52% 21,045 1,156 612 472 744 653 568 582 874 628 1,289 2,508 3,438 2,595 1,594 1,671 1,661 2,985 3,305 3,797 6,290 7,235 4,189 8,262 2,262 730 3,425 439 271 9,358 4,521 2,320 34.31% 15.78% 2.726937 81,229.87 34,497.37
CALAM - WALNUT GROVE 12 5 5 0 0 1 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 12 2 10 15.75% 5 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 1 2 2 2 0 2 0 0 1 0 0 2 1 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
CALIFORNIA STATE FAIR 532 78 262 91 0 48 0 52 14.68% 49.25% 17.13% 0.00% 9.10% 0.00% 9.85% 526 152 374 28.89% 285 65 13 8 5 9 14 2 0 23 29 30 35 21 11 17 3 91 48 59 140 93 32 0 0 0 0 0 0 285 177 95 62.11% 33.45% 1.820000 52,886.00 33,141.00
CARMICHAEL WATER DISTRICT 39,253 6,192 25,026 2,230 68 3,326 295 2,116 15.78% 63.76% 5.68% 0.17% 8.47% 0.75% 5.39% 38,700 5,000 33,700 12.92% 15,937 570 534 513 472 398 607 522 684 541 996 1,595 1,782 1,724 1,200 1,678 2,122 2,088 2,751 2,591 4,839 4,373 2,924 5,256 1,399 669 3,147 358 177 7,534 4,056 2,068 36.48% 18.29% 2.405914 96,967.64 46,901.80
CITRUS HEIGHTS WATER DISTRICT 68,912 12,380 48,148 2,092 162 2,875 71 3,186 17.96% 69.87% 3.04% 0.23% 4.17% 0.10% 4.62% 68,581 6,961 61,620 10.15% 25,633 1,012 569 446 769 665 867 841 723 1,165 1,875 3,057 3,954 2,744 2,332 2,533 2,080 2,796 4,261 4,932 7,057 8,886 5,075 10,344 3,553 1,380 4,293 554 286 10,996 5,759 2,620 38.49% 16.72% 2.653808 82,960.78 37,323.17
CITY OF SACRAMENTO MAIN 516,189 151,211 159,508 62,060 1,249 98,585 9,242 34,334 29.29% 30.90% 12.02% 0.24% 19.10% 1.79% 6.65% 508,800 77,003 431,797 15.13% 194,000 9,540 9,401 6,217 6,407 5,804 6,255 6,278 6,139 6,729 13,349 17,396 26,982 20,453 15,080 17,439 20,531 31,564 31,205 30,745 62,769 57,728 35,533 67,435 21,769 8,217 29,857 3,476 1,805 96,708 47,510 24,524 37.50% 17.81% 2.609594 84,694.02 39,105.61
DEL PASO MANOR COUNTY WATER DI 5,592 687 3,967 390 15 119 31 382 12.28% 70.95% 6.97% 0.26% 2.13% 0.56% 6.84% 5,592 621 4,971 11.10% 2,222 170 45 54 66 21 51 66 237 40 158 278 166 171 120 347 231 336 416 436 752 601 291 922 326 189 572 112 68 729 509 114 42.59% 16.67% 2.516895 90,374.38 40,254.83
DELTA CROSSING MHP 0 0 0 0 0 0 0 0 69.19% 28.71% 0.00% 0.00% 0.00% 0.00% 2.10% 0 0 0 17.42% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45.66% 25.57% 2.550000 56,250.00 23,510.00
EAST WALNUT GROVE [SWS] 3 2 2 0 0 0 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 3 1 3 15.75% 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
EDGEWATER MOBILE HOME PARK 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
EL DORADO MOBILE HOME PARK 139 84 11 15 0 19 0 11 60.26% 7.80% 10.48% 0.00% 13.27% 0.00% 8.19% 139 60 79 43.12% 48 6 10 0 4 6 1 0 8 1 7 0 1 0 4 0 1 19 15 8 34 9 4 3 0 0 10 5 5 35 17 10 46.70% 31.09% 2.710000 29,468.00 17,394.00
EL DORADO WEST MHP 148 89 12 16 0 20 0 12 60.26% 7.80% 10.48% 0.00% 13.27% 0.00% 8.19% 147 63 84 43.12% 51 6 10 0 4 6 1 0 8 2 8 0 1 0 5 0 1 20 16 8 37 9 5 3 0 0 10 6 6 38 18 10 46.70% 31.09% 2.710000 29,468.00 17,394.00
ELEVEN OAKS MOBILE HOME COMMUNITY 233 45 94 56 0 37 0 1 19.27% 40.19% 24.01% 0.00% 15.91% 0.00% 0.62% 233 87 146 37.48% 71 7 2 3 6 10 2 1 1 3 1 13 17 3 0 3 0 17 17 15 34 32 3 8 3 1 21 1 1 42 29 23 46.85% 35.36% 3.280000 60,521.00 18,213.00
ELK GROVE WATER SERVICE 42,647 7,656 19,550 3,209 70 8,939 388 2,835 17.95% 45.84% 7.53% 0.16% 20.96% 0.91% 6.65% 42,258 3,264 38,994 7.72% 13,239 430 202 253 224 328 102 345 292 245 667 1,117 1,441 1,470 1,386 1,907 2,832 1,108 1,311 1,784 2,420 3,225 2,856 7,552 1,903 628 2,861 283 113 2,826 1,595 864 28.55% 12.12% 3.179068 122,771.00 43,429.03
FAIR OAKS WATER DISTRICT 36,003 4,655 27,050 708 94 1,372 12 2,113 12.93% 75.13% 1.97% 0.26% 3.81% 0.03% 5.87% 35,775 2,852 32,923 7.97% 14,233 546 332 113 229 208 391 206 469 293 804 1,064 2,214 1,447 1,568 1,875 2,474 1,220 1,568 1,868 2,788 4,082 3,016 7,090 1,872 845 3,092 261 108 4,051 1,844 768 27.94% 12.09% 2.480217 107,985.74 54,435.01
FLORIN COUNTY WATER DISTRICT 9,951 2,963 1,548 1,394 7 2,743 866 430 29.78% 15.56% 14.01% 0.07% 27.56% 8.70% 4.32% 9,835 1,285 8,550 13.06% 2,755 84 125 53 154 103 46 86 176 224 258 223 432 297 215 143 137 417 635 481 1,051 913 512 981 426 90 675 49 28 1,100 476 260 34.48% 13.70% 3.573005 67,048.12 24,517.64
FOLSOM STATE PRISON 3,536 1,257 652 1,390 57 70 34 77 35.55% 18.43% 39.31% 1.60% 1.97% 0.96% 2.17% 29 1 28 2.20% 23 0 0 0 0 0 0 0 0 0 0 0 0 4 4 12 1 0 0 0 0 1 8 3 1 0 0 0 0 19 0 0 4.67% 0.53% 2.726311 161,047.22 2,271.22
FOLSOM, CITY OF - ASHLAND 3,845 318 2,934 43 1 125 1 423 8.26% 76.32% 1.12% 0.03% 3.26% 0.02% 10.99% 3,780 143 3,637 3.79% 1,800 44 17 104 43 34 209 103 74 43 43 158 248 132 80 123 345 208 463 201 670 449 212 594 164 90 847 368 82 358 196 74 40.42% 13.70% 2.087286 76,810.17 56,773.97
FOLSOM, CITY OF - MAIN 62,462 8,433 35,222 1,693 105 12,934 177 3,897 13.50% 56.39% 2.71% 0.17% 20.71% 0.28% 6.24% 62,115 3,405 58,710 5.48% 22,409 807 218 390 477 418 283 329 373 451 670 1,181 2,255 2,382 1,747 4,083 6,344 1,892 1,855 1,851 3,747 4,106 4,129 11,491 2,728 1,179 3,590 237 146 7,328 3,010 1,321 26.66% 11.81% 2.769356 141,856.37 58,469.35
FREEPORT MARINA 3 2 1 0 0 0 0 0 69.19% 28.71% 0.00% 0.00% 0.00% 0.00% 2.10% 3 1 3 17.42% 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45.66% 25.57% 2.550000 56,250.00 23,510.00
GALT, CITY OF 21,490 9,314 9,952 520 22 872 20 789 43.34% 46.31% 2.42% 0.10% 4.06% 0.09% 3.67% 21,341 1,404 19,937 6.58% 6,988 139 168 243 210 141 342 161 347 152 550 687 807 1,096 504 789 650 761 1,143 1,237 1,904 2,044 1,601 3,724 907 523 1,454 109 44 1,809 906 414 27.52% 14.05% 3.048249 90,632.93 33,685.54
GOLDEN STATE WATER CO - ARDEN WATER SERV 6,556 1,706 2,887 322 0 888 11 742 26.02% 44.04% 4.91% 0.00% 13.54% 0.16% 11.32% 6,453 1,626 4,828 25.19% 2,173 19 82 19 141 53 173 34 179 37 139 351 319 132 172 141 183 262 476 490 738 809 303 728 239 123 131 0 0 1,315 599 335 38.56% 21.09% 2.897716 66,579.36 30,417.36
GOLDEN STATE WATER CO. - CORDOVA 48,115 9,009 26,042 3,982 229 6,050 188 2,615 18.72% 54.13% 8.28% 0.48% 12.57% 0.39% 5.43% 47,835 4,408 43,427 9.21% 18,022 509 482 310 496 480 437 389 469 598 1,276 1,692 2,653 2,565 1,671 1,948 2,047 1,796 2,374 2,968 4,170 5,621 4,236 7,380 2,174 836 3,506 364 201 7,137 2,744 1,410 29.31% 13.58% 2.650717 96,697.06 42,695.41
HAPPY HARBOR (SWS) 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
HOLIDAY MOBILE VILLAGE 46 18 7 3 0 15 0 3 38.66% 15.12% 7.10% 0.00% 32.49% 0.00% 6.64% 46 10 36 22.33% 16 2 1 0 1 0 1 5 1 0 0 2 2 1 0 0 0 4 7 2 11 4 1 2 0 0 2 1 1 12 6 4 44.88% 28.55% 2.860000 38,491.00 16,707.00
HOOD WATER MAINTENCE DIST [SWS] 1 1 0 0 0 0 0 0 69.19% 28.71% 0.00% 0.00% 0.00% 0.00% 2.10% 1 0 1 17.42% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45.66% 25.57% 2.550000 56,250.00 23,510.00
IMPERIAL MANOR MOBILEHOME COMMUNITY 209 52 129 1 0 6 0 21 24.93% 61.63% 0.45% 0.00% 2.93% 0.00% 10.05% 209 45 164 21.48% 124 4 26 18 3 0 16 7 5 6 1 4 29 0 0 0 6 51 34 5 84 34 0 9 0 0 89 37 34 27 27 22 50.97% 45.07% 1.680363 31,831.84 32,878.17
KORTHS PIRATES LAIR 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
LAGUNA DEL SOL INC 24 5 18 0 0 0 0 0 21.55% 75.20% 0.00% 0.67% 1.46% 0.00% 1.12% 24 2 22 6.40% 9 0 1 1 0 0 0 0 0 0 0 0 2 0 0 1 2 2 1 0 3 2 1 5 2 2 3 0 0 2 0 0 23.37% 23.37% 2.640000 95,227.00 50,793.00
LAGUNA VILLAGE RV PARK 20 3 2 1 0 11 2 2 12.79% 8.48% 7.28% 0.00% 52.62% 8.38% 10.45% 20 2 18 11.79% 7 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 2 2 1 3 1 0 1 0 0 3 1 0 32.52% 12.26% 3.030000 84,332.00 32,668.00
LINCOLN CHAN-HOME RANCH 4 2 2 0 0 0 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 4 1 3 15.75% 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
LOCKE WATER WORKS CO [SWS] 1 0 0 0 0 0 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 1 0 1 15.75% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
MAGNOLIA MUTUAL WATER 1 0 0 0 0 0 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 1 0 1 15.75% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
MC CLELLAN MHP 269 52 108 65 0 43 0 2 19.27% 40.19% 24.01% 0.00% 15.91% 0.00% 0.62% 269 101 168 37.48% 82 8 2 3 7 11 2 2 1 3 1 15 20 3 0 3 0 20 19 17 39 36 3 9 4 2 25 1 1 48 34 27 46.85% 35.36% 3.280000 60,521.00 18,213.00
OLYMPIA MOBILODGE 290 70 81 18 0 101 16 3 24.12% 28.03% 6.30% 0.00% 34.95% 5.53% 1.08% 290 68 222 23.43% 114 11 0 6 10 9 3 13 0 0 10 19 8 3 12 5 5 28 25 29 53 36 14 31 22 10 51 12 10 33 9 7 37.35% 23.74% 2.510000 53,786.00 29,451.00
ORANGE VALE WATER COMPANY 17,387 2,658 12,308 241 181 633 86 1,281 15.28% 70.79% 1.39% 1.04% 3.64% 0.49% 7.37% 17,288 1,904 15,384 11.01% 6,595 389 111 61 94 226 58 274 120 181 372 752 990 901 626 678 766 655 858 1,123 1,512 2,113 1,526 3,246 1,021 453 1,686 315 185 1,663 693 305 30.77% 14.29% 2.608348 92,693.71 42,509.89
PLANTATION MOBILE HOME PARK 10 4 1 1 0 3 0 1 38.66% 15.12% 7.10% 0.00% 32.49% 0.00% 6.64% 10 2 7 22.33% 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 2 1 0 1 0 0 0 0 0 2 1 1 44.88% 28.55% 2.860000 38,491.00 16,707.00
RANCHO MARINA 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
RANCHO MURIETA COMMUNITY SERVI 3,239 661 2,157 120 7 188 0 106 20.42% 66.59% 3.71% 0.21% 5.80% 0.00% 3.26% 3,239 199 3,040 6.13% 1,402 59 42 0 6 5 18 74 27 75 44 81 88 118 204 241 319 108 199 125 307 213 323 1,029 205 103 270 63 57 103 41 40 22.02% 14.30% 2.307704 144,993.81 66,451.34
RIO COSUMNES CORRECTIONAL CENTER [SWS] 22 6 8 4 1 1 0 2 25.74% 37.49% 16.82% 2.97% 4.50% 1.81% 10.66% 4 0 4 0.00% 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 23.75% 0.00% 3.450000 115,897.00 11,095.00
RIO LINDA/ELVERTA COMMUNITY WATER DIST 11,831 2,585 7,595 337 17 765 21 512 21.85% 64.19% 2.85% 0.14% 6.46% 0.18% 4.33% 11,829 1,619 10,210 13.69% 3,762 177 156 67 169 56 113 116 114 118 173 297 607 492 431 416 259 569 518 470 1,087 1,077 922 1,918 573 157 773 114 47 1,070 519 340 32.07% 14.49% 3.123012 83,603.04 33,734.49
RIVER'S EDGE MARINA & RESORT 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
SAC CITY MOBILE HOME COMMUNITY LP 229 82 17 7 0 123 0 0 35.66% 7.50% 3.27% 0.00% 53.57% 0.00% 0.00% 229 110 119 48.14% 89 11 16 9 10 8 0 0 4 2 7 1 13 4 4 0 0 46 14 8 60 21 8 4 2 2 15 2 0 71 41 30 48.95% 35.43% 2.530000 22,380.00 16,689.00
SACRAMENTO SUBURBAN WATER DISTRICT 193,126 43,047 97,872 17,684 834 20,602 624 12,464 22.29% 50.68% 9.16% 0.43% 10.67% 0.32% 6.45% 190,984 33,399 157,585 17.49% 72,505 3,817 3,001 3,069 2,884 3,205 3,100 3,337 2,893 2,342 5,541 6,792 10,037 6,480 4,342 5,488 6,177 12,771 14,878 12,333 27,649 22,370 10,822 23,467 7,204 2,837 12,037 2,087 1,160 37,001 21,072 10,274 41.88% 19.68% 2.635471 73,746.51 35,321.18
SAN JUAN WATER DISTRICT 30,122 3,409 21,349 831 287 2,762 17 1,467 11.32% 70.87% 2.76% 0.95% 9.17% 0.06% 4.87% 30,014 1,718 28,297 5.72% 10,750 389 168 100 275 128 160 111 133 127 472 684 984 854 876 1,032 4,256 932 658 1,156 1,591 2,141 1,730 6,210 1,754 724 2,883 528 357 1,658 726 339 27.98% 13.21% 2.783858 160,696.10 72,978.42
SCWA - ARDEN PARK VISTA 8,086 990 6,016 270 12 396 8 395 12.24% 74.40% 3.33% 0.15% 4.90% 0.10% 4.88% 8,038 523 7,515 6.51% 3,303 79 36 48 77 65 38 18 49 162 139 187 253 465 208 416 1,065 241 330 326 571 579 673 1,823 520 112 673 76 23 807 384 225 29.69% 10.90% 2.424845 139,081.65 84,548.46
SCWA - LAGUNA/VINEYARD 145,495 27,502 38,496 16,568 246 50,411 2,220 10,052 18.90% 26.46% 11.39% 0.17% 34.65% 1.53% 6.91% 145,198 14,710 130,489 10.13% 45,137 1,692 666 742 878 839 1,336 850 788 752 2,363 3,198 6,037 5,323 5,057 6,578 8,038 3,978 4,565 5,561 8,543 11,598 10,380 24,581 7,232 2,916 7,878 861 471 12,677 6,368 3,337 32.04% 14.90% 3.207447 114,494.03 41,415.71
SCWA MATHER-SUNRISE 18,249 2,708 8,114 1,553 23 4,507 164 1,180 14.84% 44.47% 8.51% 0.12% 24.70% 0.90% 6.47% 18,211 1,005 17,206 5.52% 5,503 228 35 97 57 68 39 12 20 36 189 320 533 645 755 1,003 1,469 416 174 509 590 1,042 1,399 3,756 881 266 855 60 43 893 318 167 22.89% 8.66% 3.296327 147,818.01 47,448.37
SEQUOIA WATER ASSOC 0 0 0 0 0 0 0 0 44.60% 45.84% 0.00% 0.00% 5.93% 0.00% 3.63% 0 0 0 15.75% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24.49% 14.65% 2.490000 68,248.00 38,950.00
SOUTHWEST TRACT W M D [SWS] 174 29 42 24 3 75 1 0 16.58% 24.48% 13.69% 1.55% 43.11% 0.60% 0.00% 174 38 136 21.83% 57 1 2 7 0 7 0 0 10 12 3 2 5 0 1 2 4 10 29 6 39 10 1 3 1 0 8 0 0 45 29 7 52.53% 12.40% 3.040000 45,671.00 36,348.00
SPINDRIFT MARINA 0 0 0 0 0 0 0 0 3.90% 89.23% 3.23% 0.00% 0.00% 0.00% 3.63% 0 0 0 35.94% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28.02% 23.19% 1.790000 38,125.00 33,103.00
TOKAY PARK WATER CO 652 214 134 37 0 239 0 28 32.80% 20.55% 5.61% 0.00% 36.69% 0.00% 4.35% 652 113 539 17.29% 173 2 2 3 21 0 0 13 13 10 18 27 36 14 4 10 0 27 36 45 64 81 18 81 38 11 44 0 0 48 32 12 40.57% 13.58% 3.757973 62,802.24 19,400.05
TUNNEL TRAILER PARK 0 0 0 0 0 0 0 0 49.74% 34.94% 0.00% 0.00% 4.65% 0.00% 10.67% 0 0 0 0.00% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20.30% 0.00% 2.950000 153,092.00 42,507.00
VIEIRA'S RESORT, INC 4 2 2 0 0 0 0 0 41.43% 52.47% 0.00% 0.00% 4.55% 0.00% 1.56% 4 1 3 22.60% 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 40.53% 21.84% 2.030000 51,977.00 40,522.00
WESTERNER MOBILE HOME PARK 32 6 6 9 0 10 0 1 17.59% 17.62% 28.31% 0.55% 31.36% 0.00% 4.57% 31 7 24 23.76% 10 1 0 0 0 1 0 0 1 0 2 1 1 2 0 1 0 1 2 3 3 4 2 4 2 1 1 0 0 5 3 2 56.87% 29.49% 3.160000 59,296.00 23,437.00


6.4.4 Clean & Format Intermediate (Clipped) Data

For visualization and exploration, it’ll also be useful to apply some additional formatting to the clipped block-group data used in intermediate parts of the interpolation process above.

# portion of households paying more than 30% / 50% of income on housing
census_data_interpolate <- census_data_interpolate %>%      
    mutate(households_all_housing_costs_over30pct_percent = 
               100 * (households_mortgage_housing_costs_over30pct_count + 
                          households_no_mortgage_housing_costs_over30pct_count +                           
                          households_rent_housing_costs_over30pct_count) / 
               households_count, 
           .after = households_rent_housing_costs_over50pct_count) %>%      
    mutate(households_all_housing_costs_over50pct_percent = 
               100 * (households_mortgage_housing_costs_over50pct_count +                            
                          households_no_mortgage_housing_costs_over50pct_count +                           
                          households_rent_housing_costs_over50pct_count) / 
               households_count,
           .after = households_all_housing_costs_over30pct_percent)

# drop water system data except name (water_system_name)
census_data_interpolate <- census_data_interpolate %>% 
    select(-census_unit_area, -clipped_area) %>% 
    select(-c(water_system_number:water_system_population_reported)) %>% 
    select(-c(average_household_size_weighted:per_capita_income_weighted)) %>% 
    relocate(water_system_name, .after = NAME) %>% 
    relocate(areal_weight_factor, .after = water_system_name)

6.4.5 Transform Results to Long Format

For further analysis and exploration / visualization of the results, it will help to convert the results from wide to long format, and edit the group names so that they can be used as titles.

# pivot from wide to long format
water_system_demographics_long <- water_system_demographics %>% 
    # convert to long format
    # st_drop_geometry() %>% 
    pivot_longer(cols = !c(water_system_name, geometry), 
                 names_to = 'variable', 
                 values_to = 'value') %>% 
    relocate(geometry, .after = last_col())

# clean variable names and add grouping fields (type, group_type)
water_system_demographics_long <- water_system_demographics_long %>% 
    mutate(variable = variable %>% 
               # str_remove_all(pattern = 'percent_') %>% 
               str_replace_all(pattern = '_', replacement = ' ') %>% 
               str_replace_all(pattern = ' or ', replacement = ' / ') %>% 
               str_to_title(.) %>%
               str_remove_all(pattern = ' / Alaska Native')) %>% 
    mutate(variable_type = case_when(
        str_detect(variable, pattern = 'Count') ~ 'Count',
        str_detect(variable, pattern = 'Percent') ~ 'Percent',
        str_detect(variable, pattern = 'Pop Weighted') ~ 'Pop Weighted',
        str_detect(variable, pattern = 'Hh Weighted') ~ 'Hh Weighted',
        .default = NA), 
        .after = variable) %>% 
    mutate(variable_group_type = case_when(
        str_detect(variable, pattern ='Population') ~ 
            'Population',
        str_detect(variable, pattern = 'Households') ~ 
            'Households',
        str_detect(variable, pattern = 'Average Household Size Hh Weighted') ~ 
            'Household Weighted', 
        str_detect(variable, pattern = 'Median Household Income Hh Weighted') ~ 
            'Household Weighted',
        str_detect(variable, pattern = 'Per Capita Income Pop Weighted') ~ 
            'Population Weighted',
        str_detect(variable, pattern = 'Poverty') ~ 
            'Population'),
        .after = variable_type) %>% 
    mutate(variable = case_when(
        str_detect(variable, pattern = 'Households Count') ~ 
            'Households Total',
        .default = str_remove_all(variable, pattern = 'Households'))) %>% 
    mutate(variable = case_when(
        str_detect(variable, 'Population Total Count') ~ 
            'Population Total',
        .default = str_remove_all(variable, 'Population'))) %>%
    mutate(variable = str_remove_all(variable, 
                                     pattern = 'Count')) %>% 
    mutate(variable = str_remove_all(variable, 
                                     pattern = 'Percent')) %>% 
    mutate(variable = str_remove_all(variable, 
                                     pattern = ' Hh Weighted')) %>% 
    mutate(variable = str_remove_all(variable, 
                                     pattern = ' Pop Weighted')) %>% 
    mutate(variable = str_replace_all(variable, 
                                      pattern = 'Over30pct', 
                                      replacement = 'Over 30% Income')) %>% 
    mutate(variable = str_replace_all(variable, 
                                      pattern = 'Over50pct', 
                                      replacement = 'Over 50% Income')) %>% 
    mutate(variable = str_trim(variable)) %>%
    mutate(variable = str_replace_all(variable,
                                      pattern = 'k ',
                                      replacement = 'k-')) %>%
    mutate(variable = str_replace_all(variable,
                                      pattern = '0 ',
                                      replacement = '0-')) %>% 
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Black-',
                                      replacement = 'Black ')) %>% 
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Mortgage ',
                                      replacement = 'Mortgage - ')) %>%
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Rent ',
                                      replacement = 'Rent - ')) %>% 
    mutate(variable = str_replace_all(variable,
                                      pattern = 'All ',
                                      replacement = 'All Households - ')) %>% 
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Households Total',
                                      replacement = 'Total Households')) %>% 
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Population Total',
                                      replacement = 'Total Population')) %>%
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Poverty ',
                                      replacement = 'Poverty - ')) %>%
    mutate(variable = str_replace_all(variable,
                                      pattern = 'Poverty - Rate',
                                      replacement = 'Poverty Rate'))

Here’s a view of the structure of the reformatted data:

glimpse(water_system_demographics_long)
Rows: 3,472
Columns: 6
$ water_system_name   <chr> "B & W RESORT MARINA", "B & W RESORT MARINA", "B &…
$ variable            <chr> "Total Population", "Hispanic / Latino", "White", …
$ variable_type       <chr> "Count", "Count", "Count", "Count", "Count", "Coun…
$ variable_group_type <chr> "Population", "Population", "Population", "Populat…
$ value               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 41…
$ geometry            <POLYGON [m]> POLYGON ((-138282.2 13643.2..., POLYGON ((…

We can also do the same for the clipped block-group data used in intermediate parts of the interpolation process above.

census_data_interpolate_long <- census_data_interpolate %>% 
    pivot_longer(cols = !c(GEOID, NAME, water_system_name, 
                           areal_weight_factor, geometry), 
                 names_to = 'variable', 
                 values_to = 'value') %>% 
    relocate(geometry, .after = last_col())

6.5 Save Results

Once we’ve finished the computations and verified the outputs look reasonable, we can save the results to output files so they can be re-used and shared. The results can be saved in tabular (e.g., csv, excel) and/or spatial (e.g., shapefile, geopackage) formats, which may be helpful for different use cases. Note that you may need to think about exactly what variables to include in the output file(s) and how to format the output datasets (e.g., wide versus long format).

The files saved below are all available here.

The chunk of code below (which is hidden by default), just tests to see whether any of the datasets to be saved have been changed since the previous version was saved. In general this is probably not needed for a typical workflow and can be ignored for most use cases – it is just used here to make rendering of this document a little more efficient.

Code
# compute hash for datasets to be saved (i.e., a unique identifier for each dataset), and compare against previous versions

## define file that stores hash (unique identifier for dataset)
hash_file <- here('03_data_results',
                  '_dataset_hash.csv')

## compute hashes (unique identifier for datasets)
hash_current <- digest(object = water_system_demographics,
                       algo = 'md5')
hash_current_long <- digest(object = water_system_demographics_long,
                            algo = 'md5')
hash_interpolate <- digest(object = census_data_interpolate,
                           algo = 'md5')
hash_interpolate_long <- digest(object = census_data_interpolate_long, 
                                algo = 'md5')
hash_table_current <- tibble(
    dataset = c('water_system_demographics', 
                'water_system_demographics_long',
                'census_data_interpolate',
                'census_data_interpolate_long'),
    hash = c(hash_current, 
             hash_current_long,
             hash_interpolate,
             hash_interpolate_long))

## get the previous hashes from file (if it exists), else create a new file to store the hashes
if (file.exists(hash_file)) {
    hash_table_previous <- read_csv(file = hash_file)
} else {
    file.create(hash_file)
    hash_table_previous <- tibble(
        dataset = c('water_system_demographics', 
                    'water_system_demographics_long',
                    'census_data_interpolate',
                    'census_data_interpolate_long'),
        hash = c('missing', 
                 'missing',
                 'missing', 
                 'missing'))
}

## if new hash is different from previous hash, set flag to update the output file (i.e., write a new version of the file)
file_update <- !identical(hash_table_current %>% 
                              filter(dataset == 'water_system_demographics') %>% 
                              pull(hash),
                          hash_table_previous %>% 
                              filter(dataset == 'water_system_demographics') %>% 
                              pull(hash))
file_update_long <- !identical(hash_table_current %>% 
                                   filter(dataset == 'water_system_demographics_long') %>% 
                                   pull(hash),
                               hash_table_previous %>% 
                                   filter(dataset == 'water_system_demographics_long') %>% 
                                   pull(hash))
file_update_interpolate <- !identical(hash_table_current %>% 
                                          filter(dataset == 'census_data_interpolate') %>% 
                                          pull(hash),
                                      hash_table_previous %>% 
                                          filter(dataset == 'census_data_interpolate') %>% 
                                          pull(hash))
file_update_interpolate_long <- !identical(hash_table_current %>% 
                                               filter(dataset == 'census_data_interpolate_long') %>% 
                                               pull(hash),
                                           hash_table_previous %>% 
                                               filter(dataset == 'census_data_interpolate_long') %>% 
                                               pull(hash))

## write current hashes to file (for comparison with future versions)
write_csv(x = hash_table_current,
          file = hash_file,
          append = FALSE)

6.5.1 Tabular Dataset

The code below saves the tabular results to a csv file, in both the ‘wide’ and ‘long’ formats. The wide format data can also be viewed here, or downloaded with this link. The long format data can be viewed here, or downloaded with this link.

# wide
if (file_update == TRUE) {
    write_csv(water_system_demographics %>%
                  st_drop_geometry(), # drop spatial data 
              file = here('03_data_results',
                          'water_system_demographics_sac.csv'))
}

# long
if (file_update_long == TRUE) {
    write_csv(water_system_demographics_long %>%
                  st_drop_geometry(), # drop spatial data
              file = here('03_data_results',
                          'water_system_demographics_sac_long.csv'))
}

And we can save the intermediate data from the interpolation process (i.e., data for clipped block groups) in wide and long format – these files can be downloaded with this link and this link respectively.

# wide
if (file_update_interpolate == TRUE) {
    write_csv(census_data_interpolate %>%
                  st_drop_geometry(), # drop the spatial data 
              file = here('03_data_results',
                          'intermediate_interpolation_data.csv'))
}

# long
if (file_update_interpolate_long == TRUE) {
    write_csv(census_data_interpolate_long %>%
                  st_drop_geometry(), # drop the spatial data 
              file = here('03_data_results',
                          'intermediate_interpolation_data_long.csv'))
}

6.5.2 Spatial Dataset

To save the output in a geospatial format, it may be best to save the data in a wide format, so that all of the attribute data for each target area (water system) is in a single row along with its spatial data (i.e. the system boundary information) (saving in long format may create a very large file). The code below saves the results – in wide format – to a geopackage file, which is a spatial file format that is similar to a shapefile. The final water system demographic data is available to downloaded with this link, and the data from the intermediate calculations (for clipped block groups) is available to download with this link.

if (file_update == TRUE) {
    st_write(water_system_demographics,
             here('03_data_results',
                  'water_system_demographics_sac.gpkg'),
             append = FALSE)
}

if (file_update_interpolate == TRUE) {
    st_write(census_data_interpolate,
             here('03_data_results',
                  'intermediate_interpolation_data.gpkg'),
             append = FALSE)
}

7 Explore and Visualize Results

Warning

This section is in progress.

[TODO: Insert Shiny App (iframe)]

For simplicity, this section will focus on presenting estimated demographics for some of the largest water suppliers in the Sacramento county region (results for small water systems may not be very accurate and should be used with some caution - see Section 8 and Section 10 for more investigation of the results for small systems).

# Select systems to plot

## number of systems
n_systems <- 20

## get list of selected systems
systems_top_n <- water_system_demographics %>% 
    slice_max(population_total_count, n = n_systems) %>% 
    pull(water_system_name)

7.1 Race / Ethnicity

[placeholder]

  • percent by group (bar)

  • dot-density

  • mapview (non-white)

7.2 Income Distributions

[placeholder]

  • income brackets (50k) (bar)

  • median household income by block group (dots)

  • dot-density (below threshold value)

  • mapview

7.3 Poverty Rates

[placeholder]

  • dot plot

  • mapview

  • side-by-side map

7.4 Income & Relative Housing Costs

The biscale R package (Prener, Grossenbacher, and Zehr 2022) can be used to create maps that show how two metrics vary together spatially (bivariate choropleth maps).

Figure 5 shows the relationship between estimated income and relative housing costs for the top 20 systems by estimated population in Sacramento County.

Code
# Table B25140 - Housing Costs as a Percentage of Household Income in the past 12 months.
# Shows the count of households paying more than 30% or 50% of their income towards housing 
# costs broken out by three tenure categories (owned with a mortgage, owned without a mortgage, and rented).

# set defaults
biscale_pal <- 'BlueOr' # 'GrPink' # 'DkViolet2'
biscale_dim <- 3

# create classes
biscale_data <- bi_class(water_system_demographics %>% 
                             filter(water_system_name %in% systems_top_n) %>% 
                             filter(!is.na(median_household_income_hh_weighted)), 
                         x = households_all_housing_costs_over30pct_percent, 
                         y = median_household_income_hh_weighted, 
                         style = "quantile", 
                         dim = biscale_dim)

# create map
biscale_map <- ggplot() +
    geom_sf(data = biscale_data, 
            mapping = aes(fill = bi_class), 
            color = "white", 
            size = 0.1, 
            show.legend = FALSE) +
    bi_scale_fill(pal = biscale_pal, 
                  dim = biscale_dim) + 
    labs(
        title = "Estimated % of Households Paying More Than 30% of Income Towards Housing Costs \nand Estimated Median Household Income in Sacramento Water Systems",
        subtitle = glue("Top {n_systems} Water Systems by Population"),
        caption = glue("Data estimated from {acs_year} 5-year ACS Block Groups")
        # title = "Estimated Housing Cost as % of Household Income and \nEstimated Median Household Income in Sacramento Water Systems", 
        # caption = "% Housing cost shows the percent of households paying more than 30% of their income towards housing costs \nIncome shows median household income (yellow = missing)"
    ) +
    #   labs(
    #   title = "Housing Cost<sup>1</sup> and Income<sup>2</sup> in Sacramento Water Systems",
    #   caption = "<sup>1</sup>% of households paying more than 30% of their income towards housing costs<br><sup>2</sup>Median household income (yellow = missing)",
    #   subtitle = glue("Top {n_systems} systems by population")
    # ) +
    # add missing polygons back in
    geom_sf(data = water_system_demographics %>% 
                filter(water_system_name %in% systems_top_n) %>% 
                filter(is.na(median_household_income_hh_weighted)),
            color = "white",
            fill = 'gold'
    ) +
    geom_sf(data = counties_ca %>% filter(NAME == 'Sacramento'), 
            color = 'grey',
            fill = NA) +
    bi_theme() + 
    theme(plot.title = element_text(size=12), # element_markdown(size=12)
          plot.subtitle = element_text(size=10),
          plot.caption = element_text(size=8, hjust = 1)) # element_markdown(size=8, hjust = 1))

# create legend
biscale_legend <- bi_legend(pal = biscale_pal,
                            dim = biscale_dim,
                            xlab = "% Housing Costs ",
                            ylab = "Income ",
                            size = 8)

# construct map
biscale_plot <- ggdraw() +
    draw_plot(biscale_map, 0, 0, 1, 1) +
    draw_plot(biscale_legend, 0.1, .65, 0.2, 0.2)

biscale_plot
Figure 5

Figure 6 shows the same variables (relative housing costs and income) for the portions block groups overlapping Sacramento Suburban Water District – this illustrates the data underlying the interpolation process.

Code
# set defaults
biscale_pal_system <- 'BlueOr' # 'GrPink' # 'DkViolet2'
biscale_dim_system <- 3

# create classes
biscale_data_system <- bi_class(census_data_interpolate %>% 
                                    filter(water_system_name == system_plot) %>% 
                                    filter(!is.na(median_household_income)), 
                                x = households_all_housing_costs_over30pct_percent, 
                                y = median_household_income, 
                                style = "quantile", 
                                dim = biscale_dim_system)
# create map
biscale_map_system  <- ggplot() +
    geom_sf(data = biscale_data_system , 
            mapping = aes(fill = bi_class), 
            color = "white", 
            size = 0.1, 
            show.legend = FALSE) +
    bi_scale_fill(pal = biscale_pal_system, 
                  dim = biscale_dim_system) + 
    labs(
        title = glue("Estimated % of Households Paying More Than 30% of Income Towards Housing Costs \nand Estimated Median Household Income in {str_to_title(system_plot)}"),
        # subtitle = glue(""),
        caption = glue("Data from {acs_year} 5-year ACS Block Groups (Yellow = Missing Data)")#,
        # title = glue("Housing Cost and Income \nin {str_to_title(system_plot)}"), 
        # caption = "% Housing cost shows the percent of households paying more than 30% of their income towards housing costs \nIncome shows median household income (yellow = missing)"#,
    ) +
    # add the missing polygons back in
    geom_sf(data = census_data_interpolate %>% 
                filter(water_system_name == system_plot) %>% 
                filter(is.na(median_household_income)),
            color = "white",
            fill = 'gold'
    ) +
    bi_theme() + 
    theme(plot.title = element_text(size=12), # element_markdown(size=12)
          plot.subtitle = element_text(size=10),
          plot.caption = element_text(size=8, hjust = 1)) # element_markdown(size=8, hjust = 1))

# create legend
biscale_legend <- bi_legend(pal = biscale_pal_system,
                            dim = biscale_dim_system,
                            xlab = "% Housing Costs ",
                            ylab = "Income ",
                            size = 8)

# construct map
biscale_plot_system <- ggdraw() +
    draw_plot(biscale_map_system, 0, 0, 1, 1) +
    draw_plot(biscale_legend, 0.1, .55, 0.2, 0.2)

biscale_plot_system
Figure 6

8 Check - Estimated vs Reported Population Estimates

[TO DO: Create map]

Based on the map above, it’s apparent that it will be difficult to obtain reasonable estimates for some suppliers, such as the suppliers with very small service areas in the southern portion of the county where the block groups are very large (and the supplier’s service are is only a small fraction of the total area of the block group). These issues are explored further in Section 10.

Note that there are a number of reasons why the estimated population values are likely to differ from the population numbers in the water system dataset (e.g., the depicted boundaries may not be correct or exact, the supplier may have used different methods to count/estimate the population they serve, the time frames for the estimates may be different, etc.). But, there may also be some cases where the numbers differ significantly – depending on the actual analysis being performed, this may mean that further work is needed for certain areas, or could mean that this method may not be sufficient and different methods are needed.

As a check, we can add a column to the interpolated dataset (which we’ll call population_percent_difference) to compute the difference between the estimated total population (in the population_total field) and the total population listed in the water_system_population_reported field (which is the reported value from the water system dataset).

water_system_demographics_check <- water_system_demographics %>% 
    left_join(water_systems_sac %>% 
                  st_drop_geometry() %>% 
                  select(water_system_name, water_system_population_reported,
                         water_system_service_connections),
              by = 'water_system_name')

water_system_demographics_check <- water_system_demographics_check %>%
    mutate(population_percent_difference =
               round(100 * (population_total_count - water_system_population_reported) / 
                         water_system_population_reported, 
                     2), 
           .after = water_system_population_reported)

For larger water systems, the estimated population values seem to be roughly in line with the population numbers in the original dataset– you can see this in the upper rows of Table 2.

pct_format <- label_percent(accuracy = 0.01)

water_system_demographics_check %>%
    st_drop_geometry() %>%
    arrange(desc(water_system_population_reported)) %>%
    select(water_system_name, 
           water_system_service_connections,
           water_system_population_reported, 
           population_total_count,
           population_percent_difference,
           ) %>%
    mutate(population_percent_difference = pct_format(
        population_percent_difference / 100)) %>%
    rename('Water System Name' = water_system_name, 
           'Service Connections' = water_system_service_connections,
           'Estimated Population' = population_total_count,
           'Reported Population' = water_system_population_reported,
           'Percent Difference' = population_percent_difference,
           ) %>%
    kable(align = 'c', 
          format.args = list(big.mark = ',')
          ) %>%
    scroll_box(height = "400px")
Table 2: Water Systems Sorted by Reported Population (Largest to Smallest)
Water System Name Service Connections Reported Population Estimated Population Percent Difference
CITY OF SACRAMENTO MAIN 142,794 510,931 516,189 1.03%
SACRAMENTO SUBURBAN WATER DISTRICT 46,573 184,385 193,126 4.74%
SCWA - LAGUNA/VINEYARD 47,411 172,666 145,495 -15.74%
FOLSOM, CITY OF - MAIN 21,424 68,122 62,462 -8.31%
CITRUS HEIGHTS WATER DISTRICT 19,940 65,911 68,912 4.55%
CALAM - SUBURBAN ROSEMONT 16,238 53,563 57,897 8.09%
CALAM - PARKWAY 14,779 48,738 58,635 20.31%
CALAM - LINCOLN OAKS 14,390 47,487 42,916 -9.63%
GOLDEN STATE WATER CO. - CORDOVA 14,798 44,928 48,115 7.09%
ELK GROVE WATER SERVICE 12,882 42,540 42,647 0.25%
CARMICHAEL WATER DISTRICT 11,704 37,897 39,253 3.58%
FAIR OAKS WATER DISTRICT 14,293 35,114 36,003 2.53%
CALAM - ANTELOPE 10,528 34,720 33,120 -4.61%
SAN JUAN WATER DISTRICT 10,672 29,641 30,122 1.62%
GALT, CITY OF 7,471 26,536 21,490 -19.02%
SCWA MATHER-SUNRISE 6,921 22,839 18,249 -20.10%
ORANGE VALE WATER COMPANY 5,684 18,005 17,387 -3.43%
CAL AM FRUITRIDGE VISTA 4,667 15,385 22,603 46.92%
RIO LINDA/ELVERTA COMMUNITY WATER DIST 4,621 14,381 11,831 -17.73%
SCWA - ARDEN PARK VISTA 3,043 10,035 8,086 -19.42%
FOLSOM STATE PRISON 2,790 9,703 3,536 -63.56%
FLORIN COUNTY WATER DISTRICT 2,323 7,831 9,951 27.07%
RANCHO MURIETA COMMUNITY SERVI 2,726 5,744 3,239 -43.61%
GOLDEN STATE WATER CO - ARDEN WATER SERV 1,716 5,125 6,556 27.92%
DEL PASO MANOR COUNTY WATER DI 1,796 4,520 5,592 23.72%
CALAM - ARDEN 1,185 3,908 10,112 158.75%
FOLSOM, CITY OF - ASHLAND 1,079 3,538 3,845 8.68%
RIO COSUMNES CORRECTIONAL CENTER [SWS] 13 2,800 22 -99.21%
CALAM - ISLETON 480 1,581 34 -97.85%
MC CLELLAN MHP 199 700 269 -61.57%
CALAM - WALNUT GROVE 197 651 12 -98.16%
CALIFORNIA STATE FAIR 269 650 532 -18.15%
TOKAY PARK WATER CO 198 525 652 24.19%
LAGUNA DEL SOL INC 112 470 24 -94.89%
OLYMPIA MOBILODGE 200 450 290 -35.56%
SAC CITY MOBILE HOME COMMUNITY LP 164 350 229 -34.57%
EAST WALNUT GROVE [SWS] 166 300 3 -99.00%
ELEVEN OAKS MOBILE HOME COMMUNITY 136 262 233 -11.07%
EL DORADO MOBILE HOME PARK 128 256 139 -45.70%
RANCHO MARINA 77 250 0 -100.00%
HOLIDAY MOBILE VILLAGE 115 200 46 -77.00%
IMPERIAL MANOR MOBILEHOME COMMUNITY 186 200 209 4.50%
EL DORADO WEST MHP 128 172 148 -13.95%
KORTHS PIRATES LAIR 64 150 0 -100.00%
RIVER'S EDGE MARINA & RESORT 83 150 0 -100.00%
SOUTHWEST TRACT W M D [SWS] 33 150 174 16.00%
VIEIRA'S RESORT, INC 107 150 4 -97.33%
B & W RESORT MARINA 37 100 0 -100.00%
HOOD WATER MAINTENCE DIST [SWS] 82 100 1 -99.00%
SPINDRIFT MARINA 50 100 0 -100.00%
LOCKE WATER WORKS CO [SWS] 44 80 1 -98.75%
WESTERNER MOBILE HOME PARK 49 65 32 -50.77%
HAPPY HARBOR (SWS) 45 60 0 -100.00%
SEQUOIA WATER ASSOC 18 54 0 -100.00%
PLANTATION MOBILE HOME PARK 44 44 10 -77.27%
TUNNEL TRAILER PARK 21 44 0 -100.00%
FREEPORT MARINA 27 42 3 -92.86%
EDGEWATER MOBILE HOME PARK 22 40 0 -100.00%
MAGNOLIA MUTUAL WATER 34 40 1 -97.50%
LINCOLN CHAN-HOME RANCH 19 33 4 -87.88%
LAGUNA VILLAGE RV PARK 28 32 20 -37.50%
DELTA CROSSING MHP 22 30 0 -100.00%

But for water systems with a small population and/or service area, the estimated demographics may not match the reported population numbers from the water system dataset very well – you can see this in the top rows of Table 3. This probably indicates that, for small areas, some adjustments and/or further analysis may be needed, and the preliminary estimated values should be treated with some caution/skepticism.

Note: See Section 10 below for some more investigation into water systems whose estimated population is at or near zero.

pct_format <- label_percent(accuracy = 0.01)

water_system_demographics_check %>%
    st_drop_geometry() %>%
    arrange(water_system_population_reported) %>%
    select(water_system_name, 
           water_system_service_connections,
           water_system_population_reported, 
           population_total_count,
           population_percent_difference,
           ) %>%
    mutate(population_percent_difference = pct_format(population_percent_difference / 100)) %>%
    rename('Water System Name' = water_system_name, 
           'Service Connections' = water_system_service_connections,
           'Estimated Population' = population_total_count,
           'Reported Population' = water_system_population_reported,
           'Percent Difference' = population_percent_difference,
           ) %>%
    kable(align = 'c', 
          format.args = list(big.mark = ',')
          ) %>%
    scroll_box(height = "400px")
Table 3: Water Systems Sorted by Reported Population (Smallest to Largest)
Water System Name Service Connections Reported Population Estimated Population Percent Difference
DELTA CROSSING MHP 22 30 0 -100.00%
LAGUNA VILLAGE RV PARK 28 32 20 -37.50%
LINCOLN CHAN-HOME RANCH 19 33 4 -87.88%
EDGEWATER MOBILE HOME PARK 22 40 0 -100.00%
MAGNOLIA MUTUAL WATER 34 40 1 -97.50%
FREEPORT MARINA 27 42 3 -92.86%
PLANTATION MOBILE HOME PARK 44 44 10 -77.27%
TUNNEL TRAILER PARK 21 44 0 -100.00%
SEQUOIA WATER ASSOC 18 54 0 -100.00%
HAPPY HARBOR (SWS) 45 60 0 -100.00%
WESTERNER MOBILE HOME PARK 49 65 32 -50.77%
LOCKE WATER WORKS CO [SWS] 44 80 1 -98.75%
B & W RESORT MARINA 37 100 0 -100.00%
HOOD WATER MAINTENCE DIST [SWS] 82 100 1 -99.00%
SPINDRIFT MARINA 50 100 0 -100.00%
KORTHS PIRATES LAIR 64 150 0 -100.00%
RIVER'S EDGE MARINA & RESORT 83 150 0 -100.00%
SOUTHWEST TRACT W M D [SWS] 33 150 174 16.00%
VIEIRA'S RESORT, INC 107 150 4 -97.33%
EL DORADO WEST MHP 128 172 148 -13.95%
HOLIDAY MOBILE VILLAGE 115 200 46 -77.00%
IMPERIAL MANOR MOBILEHOME COMMUNITY 186 200 209 4.50%
RANCHO MARINA 77 250 0 -100.00%
EL DORADO MOBILE HOME PARK 128 256 139 -45.70%
ELEVEN OAKS MOBILE HOME COMMUNITY 136 262 233 -11.07%
EAST WALNUT GROVE [SWS] 166 300 3 -99.00%
SAC CITY MOBILE HOME COMMUNITY LP 164 350 229 -34.57%
OLYMPIA MOBILODGE 200 450 290 -35.56%
LAGUNA DEL SOL INC 112 470 24 -94.89%
TOKAY PARK WATER CO 198 525 652 24.19%
CALIFORNIA STATE FAIR 269 650 532 -18.15%
CALAM - WALNUT GROVE 197 651 12 -98.16%
MC CLELLAN MHP 199 700 269 -61.57%
CALAM - ISLETON 480 1,581 34 -97.85%
RIO COSUMNES CORRECTIONAL CENTER [SWS] 13 2,800 22 -99.21%
FOLSOM, CITY OF - ASHLAND 1,079 3,538 3,845 8.68%
CALAM - ARDEN 1,185 3,908 10,112 158.75%
DEL PASO MANOR COUNTY WATER DI 1,796 4,520 5,592 23.72%
GOLDEN STATE WATER CO - ARDEN WATER SERV 1,716 5,125 6,556 27.92%
RANCHO MURIETA COMMUNITY SERVI 2,726 5,744 3,239 -43.61%
FLORIN COUNTY WATER DISTRICT 2,323 7,831 9,951 27.07%
FOLSOM STATE PRISON 2,790 9,703 3,536 -63.56%
SCWA - ARDEN PARK VISTA 3,043 10,035 8,086 -19.42%
RIO LINDA/ELVERTA COMMUNITY WATER DIST 4,621 14,381 11,831 -17.73%
CAL AM FRUITRIDGE VISTA 4,667 15,385 22,603 46.92%
ORANGE VALE WATER COMPANY 5,684 18,005 17,387 -3.43%
SCWA MATHER-SUNRISE 6,921 22,839 18,249 -20.10%
GALT, CITY OF 7,471 26,536 21,490 -19.02%
SAN JUAN WATER DISTRICT 10,672 29,641 30,122 1.62%
CALAM - ANTELOPE 10,528 34,720 33,120 -4.61%
FAIR OAKS WATER DISTRICT 14,293 35,114 36,003 2.53%
CARMICHAEL WATER DISTRICT 11,704 37,897 39,253 3.58%
ELK GROVE WATER SERVICE 12,882 42,540 42,647 0.25%
GOLDEN STATE WATER CO. - CORDOVA 14,798 44,928 48,115 7.09%
CALAM - LINCOLN OAKS 14,390 47,487 42,916 -9.63%
CALAM - PARKWAY 14,779 48,738 58,635 20.31%
CALAM - SUBURBAN ROSEMONT 16,238 53,563 57,897 8.09%
CITRUS HEIGHTS WATER DISTRICT 19,940 65,911 68,912 4.55%
FOLSOM, CITY OF - MAIN 21,424 68,122 62,462 -8.31%
SCWA - LAGUNA/VINEYARD 47,411 172,666 145,495 -15.74%
SACRAMENTO SUBURBAN WATER DISTRICT 46,573 184,385 193,126 4.74%
CITY OF SACRAMENTO MAIN 142,794 510,931 516,189 1.03%

9 Considerations for Detailed Population Estimates

Warning

This section is in progress.

If you’re primarily only interested in population estimates (possibly including population by race/ethnicity, age, gender, etc.) and need an estimate that’s as geographically accurate as possible, it may make more sense to use the block-level population data from the decennial census rather than block group level population data from the ACS. However, since the decennial census only occurs once every 10 years, those estimates won’t reflect recent population changes (and will get especially less accurate as we get farther from the last decennial census). But keep in mind that even the 5-year ACS is an average that encompasses previous years’ estimates, so it’s not necessarily temporally precise either.

It’s also possible to use the block-level decennial population data as a weighing factor for ACS population data (to allocate the population within block-group level ACS data).

[TO DO: add example]

10 Considerations for Small / Rural Area Estimates

Warning

This section is in progress.

For some water systems, the estimated population using the areal interpolation above (Section 6.2) was at or near zero, and it may be useful to look at an example to see what’s going on with one of those cases.

(because the water system may encompass only a small portion of one or a few census units, and the entire census unit(s) may not be representative of the small portion(s)), especially those in rural environments (where population densities are lower, population centers tend to be spread out, and census units tend to be larger).

[TO DO: insert map]

From the map above [TO DO: insert map], you can see that the service area reported for some systems are very small, only covering a small fraction of a single census unit, resulting in a population estimate that is very low. In these cases, it could be that the system area was drawn incorrectly (i.e., maybe it doesn’t really depict the entire service area), in which case the reported service area should be revised. Or, it’s possible that the population within the given census unit is very un-evenly distributed and instead there’s a relatively high density population cluster in the depicted service area, in which case a more sophisticated method than an area-weighted average should be used (e.g., maybe consider the density of buildings, roads, and/or other features associated with inhabited areas).

11 Alternative Computation Methods

Warning

This section is in progress.

11.1 Simplified Method With MOE Estimates

As noted above, determining the margin of error (MOE) for estimates computed using areal weighted interpolation to aggregate portions of census units that overlap the target area of interest may not be possible (more research may be needed). If it’s necessary to compute MOEs for your aggregated values, and/or it’s preferable to use a simpler approach that doesn’t apply areal interpolation to assign fractional portions of census units to the target area, then a simplified method could be applied.

Tip

For guidance on how calculate MOEs for some types of derived estimates, see this document.

In this case, one option could be to use a minimum coverage threshold, where entire census units whose portion of area that overlaps the target area is greater than the threshold are treated as part of the target area, and any census units whose portion of area that overlaps the target area is less than the threshold are not treated as part of the target area. But, when using a minimum coverage threshold, some water systems may not have any census units that meet the coverage threshold, so they may need to be accountetd for separately (e.g., by selecting the overlapping census unit that has the greatest portion of overlap, as is done below), or those systems could be excluded from the calculation.

Because this approach operates on entire census units, the census bureau’s recommended approach for aggregating MOEs can be applied to produce an aggregated MOE. (However, keep in mind that the aggregated MOE applies to the uncertainty in the estimate for the census units included in the aggregation, and not may not necessarily capture the uncertainty in the estimate of the target area, since the two areas are now different – i.e., there is an additional unquantified element of uncertainty/error which is not reflected in the MOE due to this mismatch. In general, any estimate which attempts to compute census demographics for areas that don’t align with the census boundaries may have some element of un-quantifiable error – more research/input may be needed.)

Tip

tidycensus has functions for calculating MOEs for derived estimates based on Census-supplied formulas, including moe_sum(), moe_product(), moe_ratio(), and moe_prop().

Here’s an example calculation:

# define threshold value
overlap_threshold <- 0.5

# get census data (with MOEs)
census_data_acs_moe <- get_acs(geography = 'block group',
                               state = 'CA', 
                               county = counties_list,
                               filter_by = water_systems_filter,
                               year = acs_year,
                               survey = 'acs5',
                               variables = census_vars_acs, 
                               output = 'wide', # can be 'wide' or 'tidy'
                               geometry = TRUE,
                               cache_table = TRUE) %>% 
    st_transform(crs_projected) # convert to common coordinate system

# compute area of overlap for each census unit
simplified_calc <- census_data_acs_moe %>%
    mutate(census_unit_area = st_area(.)) %>% 
    st_intersection(water_systems_sac %>% 
                        select(water_system_name)) %>%
    mutate(clipped_area = st_area(.)) %>% 
    mutate(overlap_portion = drop_units(clipped_area / census_unit_area)) %>% 
    mutate(geoid_system = paste(GEOID, water_system_name, sep = '|')) %>% 
    st_drop_geometry()

# determine which census units to include, based on threshold value
simplified_calc <- simplified_calc %>% 
    mutate(above_threshold = overlap_portion >= overlap_threshold)

# account for water systems with no census units that meet the threshold value
## get list of systems with at least 1 census unit above threshold
simplified_systems_above_threshold <- simplified_calc %>% 
    filter(above_threshold == TRUE) %>% 
    pull(water_system_name) %>% 
    unique()
## get list of systems with no census units above threshold    
simplified_systems_below_threshold <- water_systems_sac %>% 
    filter(!water_system_name %in% simplified_systems_above_threshold) %>% 
    pull(water_system_name)

## select the 1 census unit per system with the greatest overlap 
simplified_systems_below_threshold_keep <- simplified_calc %>% 
    filter(water_system_name %in% simplified_systems_below_threshold) %>% 
    group_by(water_system_name) %>%
    slice_max(order_by = overlap_portion, n = 1) %>%
    ungroup()

# filter census units based on threshold value (and account for water systems 
# with no census units that meet the threshold value)
geoid_system_keep_above_threshold <- simplified_calc %>% 
    filter(above_threshold == TRUE) %>% 
    pull(geoid_system)
geoid_system_keep_below_threshold <- simplified_systems_below_threshold_keep %>% 
    pull(geoid_system)

census_data_acs_moe <- census_data_acs_moe %>% 
    st_join(water_systems_sac %>% select(water_system_name)) %>% 
    mutate(geoid_system = paste(GEOID, water_system_name, sep = '|')) %>% 
    filter(geoid_system %in% c(geoid_system_keep_above_threshold, 
                               geoid_system_keep_below_threshold))

# aggregate
# [TO DO: compute aggregated values]
# simplified_calc_aggregate <- census_data_acs_moe %>% 
#     group_by(water_system_name, water_systems_filter) %>% 
#     {.} %>% 
#     ungroup()

# compute MOEs
# [TO DO - use tidycensus functions to calculate MOEs]

Figure 7 shows the census units used in this simplified method to estimate demographics for Sacramento Suburban Water District.

Code
mapview(census_data_acs_moe %>% 
            filter(water_system_name == system_plot), 
        alpha.regions = 0.8, 
        col.regions = 'grey60',
        color = 'cyan',
        # lwd = 1.3, 
        label = 'NAME',  
        layer.name = 'ACS Data', 
        legend = FALSE) + #  zcol = 'NAME'    
    mapview(water_systems_sac %>% 
                filter(water_system_name == system_plot), 
            alpha.regions = 0.3, 
            col.regions = 'darkblue',
            color = 'black',
            lwd = 1.3, 
            zcol = 'water_system_name',
            # label = 'water_system_name',
            layer.name = 'Water System Boundary', 
            legend = FALSE)
Figure 7: Water system Sacramento Suburban Water District (light blue fill / black border) and boundaries of census units (grey fill / blue border) used to estimate water system demographics for the simplified approach.

While this approach may work well for relatively large water systems (where the size of the system is significantly greater than the census units used for the analysis), for smaller water systems this method might be somewhat more problematic, as shown in Figure 8.

Code
system_plot_small <- 'RIO LINDA/ELVERTA COMMUNITY WATER DIST'

mapview(census_data_acs_moe %>% 
            filter(water_system_name == system_plot_small), 
        alpha.regions = 0.8, 
        col.regions = 'grey60',
        color = 'cyan',
        # lwd = 1.3, 
        label = 'NAME',  
        layer.name = 'ACS Data', 
        legend = FALSE) + #  zcol = 'NAME'    
    mapview(water_systems_sac %>% 
                filter(water_system_name == system_plot_small), 
            alpha.regions = 0.3, 
            col.regions = 'darkblue',
            color = 'black',
            lwd = 1.3, 
            zcol = 'water_system_name',
            # label = 'water_system_name',
            layer.name = 'Water System Boundary', 
            legend = FALSE)
Figure 8: Water system Rio Linda/Elverta Community Water Dist (light blue fill / black border) and boundaries of census units (grey fill / blue border) used to estimate water system demographics for the simplified approach.

Figure 9 shows another example of a small system – in this case there are large block groups which the water system only overlaps a small portion of.

Code
system_plot_small <- 'RANCHO MURIETA COMMUNITY SERVI'

mapview(census_data_acs %>% 
            st_filter(water_systems_sac %>% 
                          filter(water_system_name == system_plot_small)) %>% 
            filter(!GEOID %in% (census_data_acs_moe %>% 
                                    filter(water_system_name == system_plot_small) %>% 
                                    pull(GEOID))), 
        alpha.regions = 0.3, 
        col.regions = 'grey80',
        color = 'grey30',
        # lwd = 1.3, 
        label = 'NAME',  
        layer.name = 'ACS Data - Not Used', 
        legend = FALSE) + #  zcol = 'NAME' 
    mapview(census_data_acs_moe %>% 
                filter(water_system_name == system_plot_small), 
            alpha.regions = 0.8, 
            col.regions = 'grey60',
            color = 'cyan',
            # lwd = 1.3, 
            label = 'NAME',  
            layer.name = 'ACS Data - Used', 
            legend = FALSE) + #  zcol = 'NAME'    
    mapview(water_systems_sac %>% 
                filter(water_system_name == system_plot_small), 
            alpha.regions = 0.3, 
            col.regions = 'darkblue',
            color = 'black',
            lwd = 1.3, 
            zcol = 'water_system_name',
            # label = 'water_system_name',
            layer.name = 'Water System Boundary', 
            legend = FALSE)
Figure 9: Water system Rancho Murieta Community Servi (light blue fill / black border), boundaries of census units (dark grey fill / blue border) used to estimate water system demographics for the simplified approach, and boundaries of census units overlapping the water system but not included in the demographic estimates (light grey fill).

11.2 Population Weighted Interpolation

The tidycensus package also has a function for population weighted interpolation, interpolate_pw, but it uses a somewhat different methodology than the population weighted interpolation procedure applied above in Section 6.3.

Note that some water systems may not get an estimated value using this method, even if NA values are removed from the source data first.

water_system_demographics_interpolate_pw_count <- interpolate_pw(
    from = census_data_acs %>%
        filter(!is.na(population_total_count)) %>% 
        select(ends_with('_count')),
    to = water_systems_sac,
    to_id = 'water_system_name',
    extensive = TRUE, # use TRUE for count data
    weights = census_data_decennial,
    # weight_placement = 'surface',
    weight_column = 'population_total_count') %>%
    mutate(across(
        .cols = ends_with('_count'),
        .fns = ~ round(.x, 0)
    )) %>% 
    arrange(water_system_name)

This returns NAs for 17 water systems (it looks like those are small systems). Table 4 shows a comparison of the system populations estimated using interpolate_pw and the reported system populations:

pct_format <- label_percent(accuracy = 0.01)

water_system_demographics_interpolate_pw_count %>%
    select(water_system_name, population_total_count) %>% 
    st_drop_geometry() %>% 
    left_join(water_systems_sac %>%
                  st_drop_geometry() %>%
                  select(water_system_service_connections, 
                         water_system_population_reported, 
                         water_system_name),
              by = 'water_system_name') %>% 
    arrange(desc(water_system_population_reported)) %>% 
    relocate(water_system_service_connections, water_system_population_reported, 
             .before = population_total_count) %>% 
    mutate(population_percent_difference =
               round(100 * (population_total_count - water_system_population_reported) / 
                         water_system_population_reported, 
                     2), 
           .after = population_total_count) %>% 
    mutate(population_percent_difference = pct_format(
        population_percent_difference / 100)
    ) %>%
    rename('Service Connections' = water_system_service_connections,
           'Reported Population' = water_system_population_reported,
           'Estimated Population' = population_total_count,
           'Percent Difference' = population_percent_difference) %>% 
    kable(align = 'c', 
          format.args = list(big.mark = ',')) %>%
    scroll_box(height = "400px")
Table 4: Results Comparison - estimated population with interpolate_pw() vs. reported population (Sorted Largest to Smallest by Reported Population)
water_system_name Service Connections Reported Population Estimated Population Percent Difference
CITY OF SACRAMENTO MAIN 142,794 510,931 525,914 2.93%
SACRAMENTO SUBURBAN WATER DISTRICT 46,573 184,385 190,956 3.56%
SCWA - LAGUNA/VINEYARD 47,411 172,666 157,782 -8.62%
FOLSOM, CITY OF - MAIN 21,424 68,122 65,206 -4.28%
CITRUS HEIGHTS WATER DISTRICT 19,940 65,911 69,931 6.10%
CALAM - SUBURBAN ROSEMONT 16,238 53,563 60,288 12.56%
CALAM - PARKWAY 14,779 48,738 57,391 17.75%
CALAM - LINCOLN OAKS 14,390 47,487 44,168 -6.99%
GOLDEN STATE WATER CO. - CORDOVA 14,798 44,928 48,645 8.27%
ELK GROVE WATER SERVICE 12,882 42,540 42,652 0.26%
CARMICHAEL WATER DISTRICT 11,704 37,897 39,773 4.95%
FAIR OAKS WATER DISTRICT 14,293 35,114 38,819 10.55%
CALAM - ANTELOPE 10,528 34,720 36,641 5.53%
SAN JUAN WATER DISTRICT 10,672 29,641 30,997 4.57%
GALT, CITY OF 7,471 26,536 26,962 1.61%
SCWA MATHER-SUNRISE 6,921 22,839 19,629 -14.05%
ORANGE VALE WATER COMPANY 5,684 18,005 17,910 -0.53%
CAL AM FRUITRIDGE VISTA 4,667 15,385 21,116 37.25%
RIO LINDA/ELVERTA COMMUNITY WATER DIST 4,621 14,381 15,102 5.01%
SCWA - ARDEN PARK VISTA 3,043 10,035 9,617 -4.17%
FOLSOM STATE PRISON 2,790 9,703 32 -99.67%
FLORIN COUNTY WATER DISTRICT 2,323 7,831 11,114 41.92%
RANCHO MURIETA COMMUNITY SERVI 2,726 5,744 4,853 -15.51%
GOLDEN STATE WATER CO - ARDEN WATER SERV 1,716 5,125 6,516 27.14%
DEL PASO MANOR COUNTY WATER DI 1,796 4,520 5,784 27.96%
CALAM - ARDEN 1,185 3,908 11,512 194.58%
FOLSOM, CITY OF - ASHLAND 1,079 3,538 3,719 5.12%
RIO COSUMNES CORRECTIONAL CENTER [SWS] 13 2,800 NA NA
CALAM - ISLETON 480 1,581 519 -67.17%
MC CLELLAN MHP 199 700 412 -41.14%
CALAM - WALNUT GROVE 197 651 388 -40.40%
CALIFORNIA STATE FAIR 269 650 NA NA
TOKAY PARK WATER CO 198 525 530 0.95%
LAGUNA DEL SOL INC 112 470 NA NA
OLYMPIA MOBILODGE 200 450 176 -60.89%
SAC CITY MOBILE HOME COMMUNITY LP 164 350 NA NA
EAST WALNUT GROVE [SWS] 166 300 347 15.67%
ELEVEN OAKS MOBILE HOME COMMUNITY 136 262 368 40.46%
EL DORADO MOBILE HOME PARK 128 256 1,031 302.73%
RANCHO MARINA 77 250 NA NA
HOLIDAY MOBILE VILLAGE 115 200 NA NA
IMPERIAL MANOR MOBILEHOME COMMUNITY 186 200 242 21.00%
EL DORADO WEST MHP 128 172 227 31.98%
KORTHS PIRATES LAIR 64 150 NA NA
RIVER'S EDGE MARINA & RESORT 83 150 NA NA
SOUTHWEST TRACT W M D [SWS] 33 150 183 22.00%
VIEIRA'S RESORT, INC 107 150 67 -55.33%
B & W RESORT MARINA 37 100 NA NA
HOOD WATER MAINTENCE DIST [SWS] 82 100 74 -26.00%
SPINDRIFT MARINA 50 100 13 -87.00%
LOCKE WATER WORKS CO [SWS] 44 80 76 -5.00%
WESTERNER MOBILE HOME PARK 49 65 20 -69.23%
HAPPY HARBOR (SWS) 45 60 NA NA
SEQUOIA WATER ASSOC 18 54 NA NA
PLANTATION MOBILE HOME PARK 44 44 NA NA
TUNNEL TRAILER PARK 21 44 NA NA
FREEPORT MARINA 27 42 105 150.00%
EDGEWATER MOBILE HOME PARK 22 40 NA NA
MAGNOLIA MUTUAL WATER 34 40 96 140.00%
LINCOLN CHAN-HOME RANCH 19 33 NA NA
LAGUNA VILLAGE RV PARK 28 32 NA NA
DELTA CROSSING MHP 22 30 NA NA

12 Working with Other Source Datasets

Warning

This section is in progress.

In addition to using census data, it’s possible to use other types of source datasets to compute characteristics of custom target areas like water systems. The process is generally likely to be similar to the processes shown above using census data, but each source dataset may require unique considerations (e.g., to handle missing values, uncertain boundaries, etc.).

12.1 CalEnviroScreen

[TO DO: example computation of weighted average CES scores]

Notes to consider:

  • Some census tracts are missing CES scores (overall and/or for certain indicators), and have to deal with those missing values somehow

  • CES 4.0 is tract-level data, and uses 2010 census boundaries (so boundaries won’t match current ACS or decennial boundaries)

  • CES 4.0 boundaries are simplified, and boundaries between tracts are not consistent – for some types of analysis (especially when looking at point data - e.g., facilities), it may be better to use the original TIGER dataset (available from either the tidycensus or tigris R packages)


References

Parry, Josiah. 2023. “Arcgislayers: An r Interface for ArcGIS REST Services.”
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in r.” https://doi.org/10.1201/9780429459016.
Prener, Christopher, Timo Grossenbacher, and Angelo Zehr. 2022. “Biscale: Tools and Palettes for Bivariate Thematic Mapping.” https://CRAN.R-project.org/package=biscale.
Prener, Christopher, Revord, and Charles. 2019. areal: An R package for areal weighted interpolation.” Journal of Open Source Software 4 (37). https://doi.org/10.21105/joss.01221.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Walker, Kyle. 2023a. “Tigris: Load Census TIGER/Line Shapefiles.” https://CRAN.R-project.org/package=tigris.
———. 2023b. “Analyzing US Census Data,” January. https://doi.org/10.1201/9780203711415.
Walker, Kyle, and Matt Herman. 2023. “Tidycensus: Load US Census Boundary and Attribute Data as ’Tidyverse’ and ’Sf’-Ready Data Frames.” https://CRAN.R-project.org/package=tidycensus.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.